Project

SPIDAL: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Goal: Many scientific problems depend on the ability to analyze and compute on large amounts of data. This analysis often does not scale well; its effectiveness is hampered by the increasing volume, variety and rate of change (velocity) of big data. This project will design, develop and implement building blocks that enable a fundamental improvement in the ability to support data intensive analysis on a broad range of cyberinfrastructure, including that supported by NSF for the scientific community. The project will integrate features of traditional high-performance computing, such as scientific libraries, communication and resource management middleware, with the rich set of capabilities found in the commercial Big Data ecosystem. The latter includes many important software systems such as Hadoop, available from the Apache open source community. A collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah provides the broad expertise needed to design and successfully execute the project. The project will engage scientists and educators with annual workshops and activities at discipline-specific meetings, both to gather requirements for and feedback on its software. It will include under-represented communities with summer experiences, and will develop curriculum modules that include demonstrations built as 'Data Analytics as a Service.'

The project will design and implement a software Middleware for Data-Intensive Analytics and Science (MIDAS) that will enable scalable applications with the performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Further, this project will design and implement a set of cross-cutting high-performance data-analysis libraries; SPIDAL (Scalable Parallel Interoperable Data Analytics Library) will support new programming and execution models for data-intensive analysis in a wide range of science and engineering applications. The project addresses major data challenges in seven different communities: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science, and Pathology Informatics. The project libraries will have the same beneficial impact on data analytics that scientific libraries such as PETSc, MPI and ScaLAPACK have had for supercomputer simulations. These libraries will be implemented to be scalable and interoperable across a range of computing systems including clouds, clusters and supercomputers.

Updates
0 new
23
Recommendations
0 new
5
Followers
0 new
58
Reads
3 new
1845

Project log

Geoffrey Charles Fox
added a research item
Multidimensional scaling of gene sequence data has long played a vital role in analysing gene sequence data to identify clusters and patterns. However the computation complexities and memory requirements of state-of-the-art dimensional scaling algorithms make it infeasible to scale to large datasets. In this paper we present an autoencoder-based dimensional reduction model which can easily scale to datasets containing millions of gene sequences, while attaining results comparable to state-of-the-art MDS algorithms with minimal resource requirements. The model also supports out-of-sample data points with a 99.5%+ accuracy based on our experiments. The proposed model is evaluated against DAMDS with a real world fungi gene sequence dataset. The presented results showcase the effectiveness of the autoencoder-based dimension reduction model and its advantages.
Geoffrey Charles Fox
added an update
We have just released Cylon v0.2.0. This release has the following new features and many bug fixes and improvements.
C++ - Adding aggregates and group-by API
- Major performance improvements in the existing C++ API - C++ API refactoring
- Creating tables using Columns
Python (Pycylon) - Extending Cython API for extended development for other Cython/Python libraries - Aggregates and Groupby addition - Column name-based relational algebra operations and aggregate/groupby ops addition - Major performance improvements in the existing Python API
Java (JCylon) - Performance improvements
The GitHub release can be found at.
Regards, Cylon Team
See papers
and SMDS Presentation
 
Geoffrey Charles Fox
added an update
Includes support for Cylon
This is a major release of Twister2.
You can download source code from Github
Features of this release
  1. Fault Tolerance enhancements; Automated fault detection and recovery
  2. Table API(experimental) this is Cylon based
  3. TSet API improvements; Pipe capability and TSetEnvironement
Minor features
Apart from this, we have done many code improvements and bug fixes.
Next Release
In the next release, we are working to,
  • Improve and release Table API
  • TSQL; Adding SQL support
 
Geoffrey Charles Fox
added an update
Cylon Release 0.1.0
Who should use Cylon?
  • Users of Pandas dataframes or SQL interface
  • Those needing parallel data engineering
  • Those needing Python C++ Java interoperability
  • HPC Python (Dask) and Big Data (Kubernetes) environments
Major Features in v0.1.0
  • Introducing Cylon C++ engine based on Apache Arrow.
  • Cylon C++, Python (PyCylon) and Java language bindings
  • Seamless integration with Pandas and NumPy
  • Distributed operations using MPI
  • Local and distributed operations (Select, Project, Joins, Intersection, Union, Subtract)
  • Jupyter notebook support and experimental Google Colab support
 
Geoffrey Charles Fox
added a research item
The dataflow model is gradually becoming the de facto standard for big data applications. While many popular frameworks are built around this model, very little research has been done on understanding its inner workings, which in turn has led to inefficiencies in existing frameworks. It is important to note that understanding the relationship between dataflow and HPC building blocks allows us to address and alleviate many of these fundamental inefficiencies by learning from the extensive research literature in the HPC community. In this paper, we present TSet's, the dataflow abstraction of Twister2, which is a big data framework designed for high-performance dataflow and iterative computations. We discuss the dataflow model adopted by TSet's and the rationale behind implementing iteration handling at the worker level. Finally, we evaluate TSet's to show the performance of the framework and the importance of the worker level iteration model.
Geoffrey Charles Fox
added 4 research items
Support Vector Machines is one of the widely used lightweight machine learning algorithm which can do efficient training on smaller data sets. In this research, we focused on highly scaleable gradient descent based approach. In providing a scal-able solution, we propose to use high-performance computing model and big data computing model (dataflow). Designing this algorithm with MPI like programming model has been widely used. In this paper, our, objective is to enhance a training model designed by us with math kernels and analyze how C++ and Java programming languages can be used to design optimized algorithms. We also discuss the overheads in the applications and optimization techniques used to improve the performance. In this research, our objective is to build this algorithm with the support of multiple dataflow design mechanisms involving iterative and ensemble training models. For this purpose, we use Twister2, a big data tool kit which provides the basic infrastructure to address this kind of problems. And also we compare the performance of Twister2-APIs with Spark RDD and MPI. In our research, we show-case how the high-performance computing stack and big data programming stack can be used to optimize the training of SGD-based SVM algorithm in distributed environments.
We present a taxonomy of research on Machine Learning (ML) applied to enhance simulations together with a catalog of some activities. We cover eight patterns for the link of ML to the simulations or systems plus three algorithmic areas: particle dynamics, agent-based models and partial differential equations. The patterns are further divided into three action areas: Improving simulation with Configurations and Integration of Data, Learn Structure, Theory and Model for Simulation, and Learn to make Surrogates.
We recently outlined the vision of "Learning Every-where" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss "how" learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes-substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.
Geoffrey Charles Fox
added an update
Twister2-0.4.0 Release - Oct 04, 2019
  1. Connected DataFlow (classic coarse-grain orchestration)
  2. A fully compliant BEAM integration (will sent to Google for certification)
  3. Fault tolerance Complete - individual worker restart and automatic restart
  4. Initial Python version (support Keras and TensorFlow)
  5. Set of integration tests to run nightly in Delta cluster
  6. More unit tests
  7. Java API Documentation
Twister2-0.4.1 Release - Nov 01, 2019
  1. BEAM portable runner based implementation
  2. The second version of Python API
Twister2-0.5.0 Release - Dec 01, 2019
  1. The third version of Python API
 
Geoffrey Charles Fox
added an update
Twister2 Release 0.3.0-rc1
This is a major release of Twister2.
The GitHub release can be found at.
The release note can be found at,
Source code can be downloaded from, https://github.com/DSC-SPIDAL/twister2/archive/0.3.0-rc1.zip Updated Twister2 documentation can be found at, https://twister2.org/docs/introduction
Features of this release
In this release we moved to OpenMPI 4.0.1 and Python 3. Also we tested Twister2 with JDK 11.
  1. The initial version of Apache BEAM integration
  2. Fully functioning TSet API
  3. Simulator for writing applications with IDE
  4. Organize the APIs to facilitate easy creation of applications
  5. Improvements to performance including a new routing algorithm for shuffle operations
  6. Improved batch task scheduler (new batch scheduler)
  7. Inner joins and outer joins
  8. Support for reading HDFS files through TSet API
  9. The initial version of fault tolerance with manual restart
  10. Configuration structure improvements
  11. Nomad scheduler improvements
  12. New documentation website
Minor features
Apart from these, we have done many code improvements and bug fixes.
Next Release
In the next release we are working onto consolidate the Apache Beam integration and improve the fault tolerance (automatic restart of workers). It will have first release of Python API
Components in Twister2
We support the following components in Twister2
  1. Resource provisioning component to bring up and manage parallel workers in cluster environments
    1. Standalone
    2. Kubernetes
    3. Mesos
    4. Slurm
    5. Nomad
  2. Parallel and Distributed Operators in HPC and Cloud Environments
    1. Twister2:Net - a data level dataflow operator library for streaming and large scale batch analysis
    2. Harp - a BSP (Bulk Synchronous Processing) innovative collective framework for parallel applications and machine learning at message level
    3. OpenMPI (HPC Environments only) at message level
  3. Task System
    1. Task Graph
      • Create dataflow graphs for streaming and batch analysis including iterative computations
    2. Task Scheduler - Schedule the task graph into cluster resources supporting different scheduling algorithms
      • Datalocality Scheduling
      • Roundrobin scheduling
      • First fit scheduling
    3. Executor - Execution of task graph
      • Batch executor
      • Streaming executor
  4. TSet for distributed data representation (Similar to Spark RDD, Flink DataSet and Heron Streamlet)
    1. Iterative computations
    2. Data caching
  5. APIs for streaming and batch applications
    1. Operator API
    2. Task Graph based API
    3. TSet API
  6. Support for storage systems
    1. HDFS
    2. Local file systems
    3. NFS for persistent storage
  7. Web UI for monitoring Twister2 Jobs
  8. Apache Storm Compatibility API
  9. Apache BEAM API
  10. Connected DataFlow (Experimental)
    1. Supports creation of multiple dataflow graphs executing in a single job
 
Geoffrey Charles Fox
added 2 research items
High-Performance Computing (HPC) and Cyberinfrastructure have played a leadership role in computational science even since the start of the NSF computing centers program. Thirty years ago parallel computing was a centerpiece of computer science research. Naively Big Data surely requires HPC to be processed, and transformational Big Data technology such as Hadoop and Spark exploit parallelism to success. Nevertheless, the HPC community does not appear to be thriving as a leader in Data Science while parallel computing is no longer a centerpiece. Some reasons for this are the dominant presence of Industry in technology futures and the universal fascination with Artificial Intelligence and Machine Learning. Maybe the pendulum will swing back a bit, but I expect the "AI first" philosophy to dominate in the foreseeable future. Thus I describe a future where HPC thrives in collaboration with Industry and AI. In particular, I discuss the promise of MLforHPC (AI for systems) and HPCforML (systems for AI).
Saliya Ekanayake
added 2 research items
We focus on two classes of problems in graph mining: (1) finding trees and (2) anomaly detection in complex networks using scan statistics. These are fundamental problems in a broad class of applications. Most of the parallel algorithms for such problems are either based on heuristics, which do not scale very well, or use techniques like color coding, which have a high memory overhead. In this paper, we develop a novel approach for parallelizing both these classes of problems, using an algebraic representation of subgraphs as monomials—this methodology involves detecting multilinear terms in multivariate polynomials. Our algorithms show good scaling over a large regime, and they run on networks with close to half one billion edges. The resulting parallel algorithm for trees is able to scale to subgraphs of size 18, which has not been done before, and it significantly outperforms the best prior color coding based method (FASCIA) by more than two orders of magnitude. Our algorithm for network scan statistics is the first such parallelization, and it is able to handle a broad class of scan statistics functions with the same approach.
We focus on two classes of problems in graph mining: (1) finding trees and (2) anomaly detection in complex networks using scan statistics. These are fundamental problems in a broad class of applications. Most of the parallel algorithms for such problems are either based on heuristics, which do not scale very well, or use techniques like color coding, which have a high memory overhead. In this paper, we develop a novel approach for parallelizing both these classes of problems, using an algebraic representation of subgraphs as monomials---this methodology involves detecting multilinear terms in multivariate polynomials. Our algorithms show good scaling over a large regime, and they run on networks with close to half one billion edges. The resulting parallel algorithm for trees is able to scale to subgraphs of size 18, which has not been done before, and it significantly outperforms the best prior color coding based method (FASCIA) by more than two orders of magnitude. Our algorithm for network scan statistics is the first such parallelization, and it is able to handle a broad class of scan statistics functions with the same approach.
Geoffrey Charles Fox
added an update
b) The paper Perspectives on High-Performance Computing in a Big Data World
has been updated with full text
c) The paper Learning Everywhere Resource for BDEC General Links
has been updated with a new full text
d) The paper Contributions to High Performance Big Data Computing
has been updated with a new full text
 
Geoffrey Charles Fox
added a research item
Bare metal servers are widely available on public clouds to provide direct access to hardware and the system configuration with high performance storage and network devices are well suited for big data applications. Highly-optimized server with additional CPU core count and dense storage may lead to better performance in certain workloads and to ensure responsiveness of deployed services. Recent work on Hadoop ecosystems has addressed the performance improvement of scale-up machines configured with SSD storage and increased network bandwidth. The paper evaluates big data processing on dedicated clusters and provides the performance analysis of NVMe devices and SSD block storage options available on Amazon, Google, Microsoft, and Oracle Clouds. We show the benchmark results along with the system performance tests as we want to demonstrate the compute resource requirements for large-scale applications. The system capacity and limits for the underlying servers are described along with the cost analysis of scaling workloads on these platforms.
Geoffrey Charles Fox
added a research item
The performance of biomolecular molecular dynamics (MD) simulations has steadily increased on modern high performance computing (HPC) resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is increasingly becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three different XSEDE supercomputers where trajectories were read from a Lustre parallel file system. We found that strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process and that therefore dominated the overall run time. Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a crucial bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the ingestion of data exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large MPI communication costs or long times to open the single shared trajectory file. We improved overall strong scaling performance by two different approaches to file access, namely subfiling (splitting the trajectory into as many trajectory segments as number of processes) and MPI-IO with Parallel HDF5 trajectory files. Applying these strategies, we obtained near ideal strong scaling on up to 384 cores (16 nodes). We summarize our lessons-learned in guidelines and strategies on how to take advantage of the available HPC resources to gain good scalability and potentially reduce trajectory analysis times by two orders of magnitude compared to the prevalent serial approach.
Geoffrey Charles Fox
added a research item
The performance of biomolecular molecular dynamics (MD) simulations has steadily increased on modern high performance computing (HPC) resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is increasingly becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three different XSEDE supercomputers where trajectories were read from a Lustre parallel file system. We found that strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process and that therefore dominated the overall run time. Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a crucial bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the ingestion of data exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large MPI communication costs or long times to open the single shared trajectory file. We improved overall strong scaling performance by two different approaches to file access, namely subfiling (splitting the trajectory into as many trajectory segments as number of processes) and MPI-IO with Parallel HDF5 trajectory files. Applying these strategies, we obtained near ideal strong scaling on up to 384 cores (16 nodes). We summarize our lessons-learned in guidelines and strategies on how to take advantage of the available HPC resources to gain good scalability and potentially reduce trajectory analysis times by two orders of magnitude compared to the prevalent serial approach.
Geoffrey Charles Fox
added an update
Geoffrey Fox, "Perspectives on High-Performance Computing in a Big Data World", http://dsc.soic.indiana.edu/presentations/HPDC%20Presentation.pptx, ACM HPDC 2019 The 28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, Arizona, USA - June 27, 2019. There is a Google Slides version https://docs.google.com/presentation/d/1HQ1QPKO6QPEHEg9-h7y3wUZgKEspzvTqMasRObL6NF4/edit?usp=sharing and a YouTube video https://www.youtube.com/playlist?list=PLy0VLh_GFyz8VEJa1syIwzsY4zI61kKYq (short) or longer version https://www.youtube.com/playlist?list=PLy0VLh_GFyz8QWhdAO0IKfc5Mjc62dK5i both in 5 parts.
 
Geoffrey Charles Fox
added a research item
A collection of papers and activities in MLforHPC and related areas.
Geoffrey Charles Fox
added a research item
Understanding the bottlenecks in implementing stochastic gradient descent (SGD)-based distributed support vector machines (SVM) algorithm is important in training larger data sets. The communication time to do the model synchronization across the parallel processes is the main bottleneck that causes inefficiency in the training process. The model synchronization is directly affected by the mini-batch size of data processed before the global synchronization. In producing an efficient distributed model, the communication time in training model synchronization has to be as minimum as possible while retaining a high testing accuracy. The effect from model synchronization frequency over the convergence of the algorithm and accuracy of the generated model must be well understood to design an efficient distributed model. In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction of 98% in communication time can be gained (16x-24x speed up) with respect to high-frequency model synchronization. The training model optimization discussed in this paper guarantees a higher accuracy than the sequential algorithm along with faster convergence.
Geoffrey Charles Fox
added 2 research items
The dataflow model is slowly becoming the de facto standard for big data applications. While many popular frameworks are built around the dataflow model, very little research has been done on understanding the inner workings of the dataflow model. This has led to many inefficiencies in existing frameworks. It is important to note that understanding the relation between dataflow and HPC building blocks allows us to address and alleviate many of the fundamental inefficiencies in dataflow by learning from the extensive research literature in the HPC community. In this paper, we present TSet's, the dataflow abstraction of Twister2, which is a big data framework designed for high-performance dataflow and iterative computations. We discuss the dataflow model adopted by TSet's and the rationale behind implementing iteration handling at the worker level. Finally, we evaluate TSet's to show the performance of the framework.
Geoffrey Charles Fox
added an update
Twister2 0.2.0 is the second open source public release of Twister2. We are excited to bring another release of our high performance data analytics hosting environment that can work in both cloud and HPC environments.
You can download source code from Github
Major Features
This release includes the core components of realizing the above goals.
  1. Resource provisioning component to bring up and manage parallel workers in cluster environments
    1. Standalone
    2. Kubernetes
    3. Mesos
    4. Slurm
    5. Nomad
  2. Parallel and Distributed Communications in HPC and Cloud Environments
    1. Twister2:Net - a data level dataflow communication library for streaming and large scale batch analysis
    2. Harp - a BSP (Bulk Synchronous Processing) innovative collective framework for parallel applications and machine learning at message level
    3. OpenMPI (HPC Environments only) at message level
  3. Task System
    1. Task Graph
      • Create dataflow graphs for streaming and batch analysis including iterative computations
    2. Task Scheduler - Schedule the task graph into cluster resources supporting different scheduling algorithms
      • Datalocality Scheduling
      • Roundrobin scheduling
      • First fit scheduling
    3. Executor - Execution of task graph
      • Batch executor
      • Streaming executor
  4. API for creating Task Graph and Communication
    1. Communication API
    2. Task based API
    3. Data API (TSet API)
  5. Support for storage systems
    1. HDFS
    2. Local file systems
    3. NFS for persistent storage
  6. Web UI for monitoring Twister2 Jobs
  7. Apache Storm Compatibility API
These features translates to running following types of applications natively with high performance.
  1. Streaming computations
  2. Data operations in batch mode
  3. Iterative computations
 
Geoffrey Charles Fox
added a research item
The convergence of HPC and data intensive method-ologies provide a promising approach to major performance improvements. This paper provides a general desription of the interaction between traditional HPC and ML approaches and motivates the "Learning Everywhere" paradigm for HPC. We introduce the concept of "effective performance" that one can achieve by combining learning methodologies with simulation based approaches, and distinguish between traditional performance as measured by benchmark scores. To support the promise of integrating HPC and learning methods, this paper examines specific examples and opportunities across a series of domains. It concludes with a series of open computer science and cyberinfrastructure questions and challenges that the Learning Everywhere paradigm presents.
Geoffrey Charles Fox
added 2 research items
This presentation describes a Big Data Systems Environment of the Global AI Modelling and Simulation Supercomputer. We follow with the Twister2 approach to this covering both Machine Learning for HPC and HPC for Machine Learning
This paper describes opportunities at the interface between large-scale simulations, experiment design and control, machine learning (ML including deep learning DL) and High-Performance Computing. We describe both the current status and possible research issues in allowing machine learning to pervasively enhance computational science. How should one do this and where is it valuable? We focus on research challenges on computing for science and engineering (as opposed to commercial) use cases for both big data and big simulation problems.
Geoffrey Charles Fox
added 2 research items
We explore the idea of integrating machine learning with simulations to enhance the performance of the simulation and improve its usability for research and education. The idea is illustrated using hybrid openMP/MPI parallelized molecular dynamics simulations designed to extract the distribution of ions in nanoconfinement. We find that an artificial neural network based regression model successfully learns the desired features associated with the output ionic density profiles and rapidly generates predictions that are in excellent agreement with the results from explicit molecular dynamics simulations. The results demonstrate that the performance gains of parallel computing can be further enhanced by using machine learning.
Recently with many blurless or slightly blurred images, convolutional neural networks classify objects with around 90 percent classification rates, even if there are variable sized images. However, small object regions or cropping of images make object detection or classification difficult and decreases the detection rates. In many methods related to convolutional neural network (CNN), Bilinear or Bicubic algorithms are popularly used to interpolate region of interests. To overcome the limitations of these algorithms, we introduce a super-resolution method applied to the cropped regions or candidates, and this leads to improve recognition rates for object detection and classification. Large object candidates comparable in size of the full image have good results for object detections using many popular conventional methods. However, for smaller region candidates, using our super-resolution preprocessing and region candidates, allows a CNN to outperform conventional methods in the number of detected objects when tested on the VOC2007 and MSO datasets
Geoffrey Charles Fox
added an update
Geoffrey Fox for the Twister2 Team, "Big Data Overview for Twister2 Tutorial", http://dsc.soic.indiana.edu/presentations/BigDataTutorialJan2019.pptx , January 10-11 2019, 5th International Winter School on Big Data BigDat2019, http://bigdat2019.irdta.eu/ Cambridge, United Kingdom - January 7-11, 2019
Geoffrey Fox for the Twister2 Team, "Twister2 Tutorial", https://twister2.gitbook.io/twister2/tutorial
 
Geoffrey Charles Fox
added an update
Twister2 0.1.0 is the first open source public release of Twister2. We are excited to bring a high performance data analytics hosting environment that can work in both cloud and HPC environments. This is the first step towards building a complete end to end high performance solution for data analytics ranging from streaming to batch analysis to machine learning applications. Our vision is to make the system work seamlessly both in cloud and HPC environments ranging from single machines to large clusters.
You can download source code from https://github.com/DSC-SPIDAL/twister2/releases
Twister2 is 64,000 lines of code October 5 2018 and it invokes Harp with 69,000 lines
Major Features
This release includes the core components of realizing the above goals.
Resource provisioning component to bring up and manage parallel workers in cluster environments
  1. Standalone
  2. Kubernetes
  3. Mesos
  4. Slurm
  5. Nomad
Parallel and Distributed Communications in HPC and Cloud Environments
  1. Twister2:Net - a data level dataflow communication library for streaming and large scale batch analysis
  2. Harp - a BSP (Bulk Synchronous Processing) innovative collective framework for parallel applications and machine learning at message level
  3. OpenMPI (HPC Environments only) at message level
Task Graph - Create dataflow graphs for streaming and batch analysis including iterative computations
Task Scheduler - Schedule the task graph into cluster resources supporting different scheduling algorithms
  1. Datalocality Scheduling
  2. Roundrobin scheduling
  3. First fit scheduling
Executor - Execution of task graph
  1. Batch executor
  2. Streaming executor
API for creating Task Graph and Communication
  1. Communication API
  2. Task based API
Support for storage systems
  1. HDFS
  2. Local file systems
  3. NFS for persistent storage
These features translates to running following types of applications natively with high performance.
  1. Streaming computations
  2. Data operations in batch mode
  3. Iterative computations
Examples
With this release we include several examples to demonstrate various features of Twister2.
  1. A Hello World example
  2. Communication examples - how to use communications for streaming and batch
  3. Task examples - how to create task graphs with different operators for streaming and batch
  4. K-Means
  5. Sorting of records
  6. Word count
  7. Iterative examples
  8. Harp example
Road map
We have started working on our next major release that will connect the core components we have developed into a full data analytics environment. In particular it will focus on providing APIs around the core capabilities of Twister2 and integration of applications in a single dataflow.
Next release (End of December 2018)
  1. Hierarchical task scheduling - Ability to run different types of jobs in a single dataflow
  2. Fault tolerance
  3. Data API including DataSet similar to Spark RDD, Flink DataSet and Heron Streamlet
  4. Supporting different API's including Storm, Spark, Beam
  5. Heterogeneous resources allocations
  6. Web UI for monitoring Twister2 Jobs
  7. More resource managers - Pilot Jobs, Yarn
  8. More example applications
Beyond next release
  1. Implementing core parts of Twister2 with C/C++ for high performance
  2. Python APIs
  3. Direct use of RDMA
  4. FaaS APIs
  5. SQL interface
  6. Native MPI support for cloud deployments
 
Geoffrey Charles Fox
added a research item
Our project is at Interface Big Data and HPC -- High-Performance Big Data computing and this paper describes a collaboration between 7 collaborating Universities at Arizona State, Indiana (lead), Kansas, Rutgers, Stony Brook, Virginia Tech, and Utah. It addresses the intersection of High-performance and Big Data computing with several different application areas or communities driving the requirements for software systems and algorithms. We describe the base architecture including the HPC-ABDS, High-Performance Computing enhanced Apache Big Data Stack, and an application use case study identifying key features that determine software and algorithm requirements. We summarize middleware including Harp-DAAL collective communication layer, Twister2 Big Data toolkit and pilot jobs. Then we present the SPIDAL Scalable Parallel Interoperable Data Analytics Library and our work for it in core machine-learning, image processing and the application communities, Network science, Polar Science, Biomolecular Simulations, Pathology and Spatial systems. We describe basic algorithms and their integration in end-to-end use cases.
Geoffrey Charles Fox
added a research item
Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
Supun Kamburugamuve
added a research item
Streaming processing and batch data processing are the dominant forms of big data analytics today, with numerous systems such as Hadoop, Spark, and Heron designed to process the ever-increasing explosion of data. Generally, these systems are developed as single projects with aspects such as communication, task management, and data management integrated together. By contrast, we take a component-based approach to big data by developing the essential features of a big data system as independent components with polymorphic implementations to support different requirements. Consequently, we recognize the requirements of both dataflow used in popular Apache Systems and the Bulk Synchronous Processing communication style common in High-Performance Computing(HPC) for different applications. Message passing interface implementations are dominant in HPC but there are no such standard libraries available for big data. Twister:Net is a stand-alone, highly optimized dataflow style parallel communication library which can be used by big data systems or advanced users. Twister:Net can work both in cloud environments using TCP or HPC environments using Message Passing Interface implementations. This paper introduces Twister:Net and compares it with existing systems to highlight its design and performance.
Geoffrey Charles Fox
added a research item
Although the “big data” revolution first came to public prominence (circa 2010) in online enterprises like Google, Amazon, and Facebook, it is now widely recognized as the initial phase of a watershed transformation that modern society generally—and scientific and engineering research in particular—are in the process of undergoing. Responding to this disruptive wave of change, over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms data-centric discovery introduced by this revolution might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. These BDEC workshops grew out of the prior efforts of the International Exascale Software Project (IESP)—a collaboration of US, EU, and Japanese HPC communities that produced an influential roadmap for achieving exascale computing early in the next decade. It also shared the IESP’s mission to foster the co-design of shared software infrastructure for extreme-scale science that draws on international cooperation and supports a broad spectrum of major research domains. However, as we argue in more detail in this report, subsequent reflections on the content and discussions of the BDEC workshops make it evident that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data, are radically reshaping the landscape of scientific computing.
Geoffrey Charles Fox
added a research item
Deep learning methods have surpassed the performance of traditional techniques on a wide range of problems in computer vision, but nearly all of this work has studied consumer photos, where precisely correct output is often not critical. It is less clear how well these techniques may apply on structured prediction problems where fine-grained output with high precision is required, such as in scientific imaging domains. Here we consider the problem of segmenting echogram radar data collected from the polar ice sheets, which is challenging because segmentation boundaries are often very weak and there is a high degree of noise. We propose a multi-task spatiotemporal neural network that combines 3D ConvNets and Recurrent Neural Networks (RNNs) to estimate ice surface boundaries from sequences of tomographic radar images. We show that our model outperforms the state-of-the-art on this problem by (1) avoiding the need for hand-tuned parameters, (2) extracting multiple surfaces (ice-air and ice-bed) simultaneously, (3) requiring less non-visual metadata, and (4) being about 6 times faster.
Geoffrey Charles Fox
added an update
Tutorial at 4th International Winter School http://grammars.grlmc.com/BigDat2018/index.php on Big Data, Timişoara, Romania, January 22-26, 2018
Geoffrey Fox gcfexchange@gmail.com
Short Panel Talk on Requirements and Jobs https://drive.google.com/open?
id=16K961AY6SRMRvCbaNFtwAUT2aFMslRG6
Overview of Tutorial
General principles
Presentation on HPC-ABDS, Cloud Status, and Ogres Application Analysis; HPC-Cloud and Data-Simulation convergence
Twister2 Tutorial -- Initial Version
  1. Streaming word count
  2. Batch word count
  3. Install - https://github.com/DSC-SPIDAL/twister2/blob/master/INSTALL.md
  4. Examples - https://github.com/DSC-SPIDAL/twister2/blob/master/docs/examples.md
Harp-DAAL Tutorial (using Docker)
  1. Video: https://www.youtube.com/watch?v=prfPewgMrRQ
  2. This is built around a standalone Docker image (available) and covers Kmeans in detail
  3. https://github.com/DSC-SPIDAL/harp/blob/master/Hands-on-kmeans.md
  4. https://github.com/DSC-SPIDAL/harp/blob/master/Hands-on-NaiveBayes.md
  5. https://github.com/DSC-SPIDAL/harp/blob/master/Hands-on-MFSGD.md
  1. There is video on use of Google Cloud https://drive.google.com/open?id=1wl_4kLXDqGXJFYf4qtam4Cc45wRhY1om
  2. Compared to this site, there are few additional instructions in the direct instructions, that can help a user when they are running on a resource constrained environment like a laptop. These instructions are not present in the Tutorial website.
  3. We can still use the interactive web site as it contains lots of explanations, examples etc. The instructions in the website are valid but users may encounter some problems because of the resource limitations of a laptop.
  4. The Harp-DAAL video only covers the K-Means example. If one can follow that example other two examples are straight forward.
  • The SC17 tutorial consists of
  1. Setting up docker image that contains the Harp DAAL and Hadoop
  2. K-Means algorithm with an exercise on filling the blanks
  3. NB algorithm
  4. MFSGD algorithm
SPIDAL Tutorial (on Linux -- tested on Ubuntu)
  1. This goes through use of SPIDAL Clustering and Dimension Reduction as well as WebPlotViz online visualization system
  • The tutorial consist of Installation of SPIDAL software including openmpi
  1. Fungi sequence clustering
  2. Pathology data
  • The tutorial material can be found at (all the materials are ready, working on the video)
  1. Video: https://youtu.be/ZpYFKGYQ1Uk
  2. https://dsc-spidal.github.io/tutorials/
Original Abstract
Level: Intermediate
Abstract:
We discuss high performance big data computing that supports hardware, algorithms and software allowing the use of rich functionality of big data systems, such as Apache Hadoop, Spark, HBase, Flink, Heron, and HDFS, on compute architectures ranging from commodity cloud, hybrid HPC cloud, and supercomputer, with possibly customized accelerator (e.g., FPGA, GPU, TPU), having performance and security that scales and fully exploits the specialized features (communication, memory, energy, I/O, accelerator) of each different architecture, for applications ranging over pleasingly parallelizable and mapreduce jobs, to classical machine learning (e.g., random forest, SVM clustering and dimension reduction), deep learning, LDA, and large Graph analysis tasks. We expect this area to be of growing importance and this tutorial covers three aspects of this.
General principles
  • We introduce HPC-ABDS, the High-Performance Computing (HPC) enhanced Apache Big Data Stack (ABDS), which uses the major open source Big Data software environment but develops the principles allowing the use of HPC software and hardware to achieve good performance. We present several big data performance studies.
  • We introduce the Ogres as an approach to classifying big-data applications and use this to explain problem classes that need particular hardware and software support.
  • We present our analysis of the convergence between simulations and big-data applications as well as selected research about managing the convergence between HPC, Cloud, and Edge platforms.
Harp-Daal and SPIDAL (Scalable Parallel Interoperable Data Analytics Library)
  • We introduce a novel HPC-Cloud convergence framework, Harp-DAAL and demonstrate that the combination of Big Data (Hadoop) and HPC (a Harp plugin for collective communication and DAAL for computation kernels) can simultaneously achieve productivity and performance on large scale data analytics. Harp is a distributed Java-based framework that orchestrates efficient node synchronization. Harp uses DAAL, Intel’s Data Analytics Accelerator Library, for its highly optimized kernels on Intel Haswell and KNL architectures. This way, the high-level interfaces of big data tools can be combined with intra-node fine-grained parallelism that is properly optimized for different HPC nodes.
  • Harp-DaaL supports the high performance SPIDAL machine learning library with currently 20 members which are being packaged for wide distribution.
  • The tutorial covers both SPIDAL and Harp-DaaL with several examples
Twister2 Big Data Programming environment
  • We look again at Big Data Programming environments such as Hadoop, Spark, Flink, Heron, Pregel; HPC concepts such as MPI and Asynchronous Many-Task runtimes and Cloud/Grid/Edge ideas such as event-driven computing, serverless computing, workflow and Services.
  • These cross many research communities including distributed systems, databases, cyberphysical systems and parallel computing which sometimes have inconsistent worldviews.
  • There are many common capabilities across these systems which are often implemented differently in each packaged environment. For example, communication can be bulk synchronous processing or data flow; scheduling can be dynamic or static; state and fault-tolerance can have different models; execution and data can be streaming or batch, distributed or local.
  • We suggest that one can usefully build a toolkit (called Twister2 by us) that supports these different choices and allows fruitful customization for each application area. We illustrate the design of Twister2 by several point studies.
  • We describe status of Twister2 which is an open source project with an Apache 2.0 license. We go through the different parts of Twister2 and how we integrate the ideas present in existing HPC and Big Data systems,
  • Twister2 is positioned as an appropriate software environment to support high performance big data computing and includes Harp-DAAL as a critical component to support scalable machine learning.
 
Andre Luckow
added 2 research items
An increasing number of scientific applications rely on stream processing for generating timely insights from data feeds of scientific instruments, simulations, and Internet-of-Thing (IoT) sensors. The development of streaming applications is a complex task and requires the integration of heterogeneous, distributed infrastructure, frameworks, middleware and application components. Different application components are often written in different languages using different abstractions and frameworks. Often, additional components , such as a message broker (e. g. Kafka), are required to decouple data production and consumptions and avoiding issues, such as back-pressure. Streaming applications may be extremely dynamic due to factors, such as variable data rates caused by the data source, adaptive sampling techniques or network congestions, variable processing loads caused by usage of different machine learning algorithms. As a result application-level resource management that can respond to changes in one of these factors is critical. We propose Pilot-Streaming, a framework for supporting streaming frameworks, applications and their resource management needs on HPC infrastructure. Pilot-Streaming is based on the Pilot-Job concept and enables developers to manage distributed computing and data resources for complex streaming applications. It enables applications to dynamically respond to resource requirements by adding/removing resources at runtime. This capability is critical for balancing complex streaming pipelines. To address the complexity in developing and characterization of streaming applications, we present the Streaming Mini-App framework, which supports different plug-able algorithms for data generation and processing, e. g., for reconstructing light source images using different techniques. We utilize the Mini-App framework to conduct an evaluation of the Pilot-Streaming capabilities.
Different frameworks for implementing parallel data analytics applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics requirements on HPC resources. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data: a typical large-scale MD simulation of physical systems of O(100,000) atoms can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and the identification of connected atom. We evaluate Spark, Dask and \rp with respect to the provided abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding the different frameworks that enable users to select the optimal system for its application. Further, we provide a quantitative performance analysis of the different algorithms across the three frameworks using different high-performance computing resources.
Geoffrey Charles Fox
added 2 research items
Ground-penetrating radar on planes and satellites now makes it practical to collect 3D observations of the subsurface structure of the polar ice sheets, providing crucial data for understanding and tracking global climate change. But converting these noisy readings into useful observations is generally done by hand, which is impractical at a continental scale. In this paper, we propose a computer vision-based technique for extracting 3D ice-bottom surfaces by viewing the task as an inference problem on a probabilistic graphical model. We first generate a seed surface subject to a set of constraints, and then incorporate additional sources of evidence to refine it via discrete energy minimization. We evaluate the performance of the tracking algorithm on 7 topographic sequences (each with over 3000 radar images) collected from the Canadian Arctic Archipelago with respect to human-labeled ground truth.
Geoffrey Charles Fox
added 2 research items
We look again at Big Data Programming environments such as Hadoop, Spark, Flink, Heron, Pregel; HPC concepts such as MPI and Asynchronous Many-Task runtimes and Cloud/Grid/Edge ideas such as event-driven computing, serverless computing, workflow, and Services. These cross many research communities including distributed systems, databases, cyberphysical systems and parallel computing which sometimes have inconsistent worldviews. There are many common capabilities across these systems which are often implemented differently in each packaged environment. For example, communication can be bulk synchronous processing or data flow; scheduling can be dynamic or static; state and fault-tolerance can have different models; execution and data can be streaming or batch, distributed or local. We suggest that one can usefully build a toolkit (called Twister2 by us) that supports these different choices and allows fruitful customization for each application area. We illustrate the design of Twister2 by several point studies. We stress the many open questions in very traditional areas including scheduling, messaging and checkpointing.
Supun Kamburugamuve
added a research item
Worldwide data production is increasing both in volume and velocity, and with this acceleration, data needs to be processed in streaming settings as opposed to the traditional store and process model. Distributed streaming frameworks are designed to process such data in real time with reasonable time constraints. Apache Heron is a production-ready large-scale distributed stream processing framework. The network is of utmost importance to scale streaming applications to large numbers of nodes with a reasonable latency. High performance computing (HPC) clusters feature interconnects that can perform at higher levels than traditional Ethernet. In this paper the authors present their findings on integrating Apache Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path and show that they can be utilized to improve performance of distributed streaming applications.
Geoffrey Charles Fox
added an update
Papers
  1. Supun Kamburugamuve, Karthik Ramasamy, Martin Swany, Geoffrey Fox, "Low latency stream processing: Apache Heron with Infiniband & Intel Omni-Path", technical report September 2017. To be presented at UCC conference at Austin Texas December 5-8, 2017. http://dsc.soic.indiana.edu/publications/Heron_Infiniband.pdf
  2. Kannan Govindarajan, Supun Kamburugamuve, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, "Task Scheduling in Big Data - Review, Research: Challenges, and Prospects" Technical Report October 31 2017 http://dsc.soic.indiana.edu/publications/IEEE_Conference_ICoAC_submitted.pdf
  3. Langshi Chen, Bo Peng, Zhao Zhao, Saliya Ekanayake, Anil Vullikanti, Madhav Marathe, Shaojuan Zhu, Emily Mccallum, Lisa Smith, Lei Jiang, Judy Qiu, "A New Pipelined Adaptive-Group Communication for Large-Scale Subgraph Counting", Technical Report October 22 2017 http://dsc.soic.indiana.edu/publications/Subgraph.pdf
Presentations
  1. Geoffrey Fox, Judy Qiu, Peng Bo, Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, "HPC Cloud and Big Data Testbed", presentation at Fudan University Shanghai, November 21 2017 http://dsc.soic.indiana.edu/presentations/Fudan-BigDataTestbed-Nov21-17.pptx
  2. Geoffrey Fox, Judy Qiu, Martin Swany, Thomas Sterling, Gregor von Laszewski, "Engineering Cyberinfrastructure in Intelligent Systems Engineering at Indiana University", presentation at Fudan University Shanghai, November 20 2017 http://dsc.soic.indiana.edu/presentations/Fudan-Cyberinfrastructure-Nov20-17.pptx
  3. Judy Qiu, "Harp-DAAL: A Next Generation Platform for High Performance Machine Learning", Invited talk November 15, 2017 at  SC17  conference, Denver CO 2017 http://dsc.soic.indiana.edu/presentations/Qiu_SC_17_November_15.pptx
  4. Comet Virtualization Team: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregor von Laszewski (IU) Fugang Wang (IU), Geoffrey C. Fox (IU), Phil Papadopoulos, "Versatile HPC: Comet Virtual Clusters for the Long Tail of Science", Indiana University Booth Presentation by Gregor von Laszewski November 14 2017 at SC17  conference, Denver CO 2017 http://dsc.soic.indiana.edu/presentations/sc17comet-long.pptx
  5. Judy Qiu, "Welcome to HPCDC Tutorial at SC 2017" Booth Tutorial at  SC17  conference, Denver CO 2017 http://dsc.soic.indiana.edu/presentations/Demo%20for%20Harp-DAAL_Framework_and_Applications.pptx
  6. Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykon, Geoffrey Fox, "Twister2: Design of a Big Data Toolkit", at  EXAMPI 2017 workshop November 12 2017 at SC17  conference, Denver CO 2017. http://dsc.soic.indiana.edu/presentations/Twister2-EXAMPI17-sc17_nov_12_2017.pptx
  7. Langshi Chen, Mihai Avram, Supun Kamburugamuve, Judy Qiu, "Tutorial: Harp-DAAL: A High Performance Machine Learning Framework for HPC-Cloud", Presentation at Intel Developers Conference at SC17  conference, Denver CO 2017.Sunday, November 12 Time:9:45 AM - 12:00 PM. http://dsc.soic.indiana.edu/presentations/Harp-DAAL_Framework_and_Applications.pptx
  8. Geoffrey Fox, Judy Qiu, Shantenu Jha, Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, "Designing a Big Data Toolkit spanning HPC, Grid, Edge and Cloud Computing", Colloquium at SUNY Binghamton October 13 2017 http://dsc.soic.indiana.edu/presentations/Binghamton-Oct13-17.pptx
  9. Judy Qiu, "A High Performance Model-Centric Approach to Machine Learning Across Emerging Architectures", Presentation at Oak Ridge National Laboratory, October 2 2017 http://dsc.soic.indiana.edu/presentations/Qiu_October_2_2017.pptx
 
Geoffrey Charles Fox
added a research item
Data-driven applications are essential to handle the ever-increasing volume, velocity, and verac-ity of data generated by sources such as the Web and Internet of Things devices. Simultaneously, an event-driven computational paradigm is emerging as the core of modern systems designed for database queries, data analytics, and on-demand applications. Modern big data processing runtimes and asynchronous many task (AMT) systems from high performance computing (HPC) community have adopted dataflow event-driven model. The services are increasingly moving to an event-driven model in the form of Function as a Service (FaaS) to compose services. An event-driven runtime designed for data processing consists of well-understood components such as communication, scheduling, and fault tolerance. Different design choices adopted by these components determine the type of applications a system can support efficiently. We find that modern systems are limited to specific sets of applications because they have been designed with fixed choices that cannot be changed easily. In this paper, we present a loosely coupled component-based design of a big data toolkit where each component can have different implementations to support various applications. Such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to HPC environments. 1 INTRODUCTION Big data has been characterized by the ever-increasing velocity, volume, and veracity of the data generated from various sources, ranging from web users to Internet of Things devices to large scientific equipment. The data have to be processed as individual streams and analyzed collectively, either in streaming or batch settings for knowledge discovery with both database queries and sophisticated machine learning. These applications need to run as services in cloud environments as well as traditional high performance clusters. With the proliferation of cloud-based systems and Internet of Things, fog computing (1) is adding another dimension to these applications where part of the processing has to occur near the devices. Parallel and distributed computing are essential to process big data owing to the data being naturally distributed and processing often requiring high performance in compute, communicate and I/O arenas. Over the years, the High Performance Computing community has developed frameworks such as message passing interface to execute computationally intensive parallel applications efficiently. HPC applications target high performance hardware, including low latency networks due to the scale of the applications and the required tight synchronous parallel operations. Big data applications have been developed for commodity hardware with Ethernet connections seen in the cloud. Because of this, they are more suitable for executing asynchronous parallel applications with high computation to communication ratios. Recently, we have observed that more capable hardware comparable to HPC clusters is being added to modern clouds due to increasing demand for cloud applications in deep learning 0 Abbreviations: Big data, Serverless Computing, Event-driven
Geoffrey Charles Fox
added a research item
We review the High Performance Computing Enhanced Apache Big Data Stack HPC-ABDS and summarize the capabilities in 21 identified architecture layers. These cover Message and Data Protocols, Distributed Coordination, Security & Privacy, Monitoring, Infrastructure Management, DevOps, Interoperability, File Systems, Cluster & Resource management, Data Transport, File management, NoSQL, SQL (NewSQL), Extraction Tools, Object-relational mapping, In-memory caching and databases, Inter-process Communication, Batch Programming model and Runtime, Stream Processing, High-level Programming, Application Hosting and PaaS, Libraries and Applications, Workflow and Orchestration. We summarize status of these layers focusing on issues of importance for data analytics. We highlight areas where HPC and ABDS have good opportunities for integration.
Geoffrey Charles Fox
added an update
Indiana University (Fox, Qiu, Crandall, von Laszewski, Rutgers (Jha), Virginia Tech (Marathe), Kansas (Paden), Stony Brook (Wang), Arizona State (Beckstein), Utah (Cheatham), “Summary of NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science”, September 15 2017. http://dsc.soic.indiana.edu/presentations/Dibbs-NSF-Sept15-2017.pptx and http://dsc.soic.indiana.edu/publications/SPIDALSummary.pdf
 
Geoffrey Charles Fox
added an update
  1. Geoffrey Fox, with Judy Qiu, Shantenu Jha, Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, "MPI, Dataflow, Streaming: Messaging for Diverse Requirements", 25 years of MPI Symposium at EuroMPI/USA 2017 conference, September 25, 2017 http://dsc.soic.indiana.edu/presentations/EuroMPI-Sept25-17.pptx
  2. Geoffrey Fox, with Judy Qiu, Shantenu Jha, Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe "HPC-enhanced IoT and Data-based Grid", Keynote September 14 2017, BASARIM 2017, 5th Turkish National High Performance Computing Conference, September 14-15, 2017, Yildiz Technical University, Istanbul, Turkey http://dsc.soic.indiana.edu/presentations/Istanbul-Sept14-17.pptx
 
Geoffrey Charles Fox
added an update
Describes Twister2 Project
Geoffrey Fox at the The 13th International Conference on Semantics, Knowledge and Grids SKG2017  Beijing, China August 15, 2017
 
Geoffrey Charles Fox
added 2 research items
First International Workshop on Serverless Computing (WoSC) 2017 Report from workshop and panel on the Status of Serverless Computing and Function-as-a-Service (FaaS) in Industry and Research Geoffrey C. Fox (Indiana University) Vatche Ishakian (Bentley University) Vinod Muthusamy (IBM) Aleksander Slominski (IBM). This whitepaper summarizes issues raised during the First International Workshop on Serverless Computing (WoSC) 2017 held June 5th, 2017 and especially in the panel and associated discussion that concluded the workshop. We also include comments from the keynote and submitted papers. A glossary at the end (section 8) defines many technical terms used in this report. Panel participants: Geoffrey C. Fox (Indiana University), Rodric Rabbah (IBM), Garrett McGrath (University of Notre Dame), Edward Oakes (University of Wisconsin-Madison), Ryan Chard (Argonne National Laboratory), and Ali Kanso (IBM)
Data-driven applications are required to adapt to the ever-increasing volume, velocity and veracity of data generated by a variety of sources including the Web and Internet of Things devices. At the same time, an event-driven computational paradigm is emerging as the core of modern systems designed for both database queries, data analytics and on-demand applications. MapReduce has been generalized to Map Collective and shown to be very effective in machine learning. However one often uses a dataflow computing model, which has been adopted by most major big data processing runtimes. The HPC community has also developed several asynchronous many tasks (AMT) systems according to the dataflow model. From a different point of view, the services community is moving to an increasingly event-driven model where (micro)services are composed of small functions driven by events in the form of Function as a Service(Faas) and serverless computing. Such designs allow the applications to scale quickly as well as be cost effective in cloud environments. An event-driven runtime designed for data processing consists of well-understood components such as communication, scheduling, and fault tolerance. One can make different design decisions for these components that will determine the type of applications a system can support efficiently. We find that modern systems are designed in a monolithic approach with a fixed set of choices that cannot be changed easily af-terwards. Because of these design choices their functionality is limited to specific sets of applications. In this paper we study existing systems (candidate event-driven runtimes), the design choices they have made for each component, and how this affects the type of applications they can support. Further we propose a loosely coupled component-based approach for designing a big data toolkit where each component can have different implementations to support various applications. We believe such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to high performance computing environments.
Supun Kamburugamuve
added a research item
With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time-constrained environment. Message passing interface (MPI) is a widely used model for developing such algorithms in high-performance computing paradigm, while Apache Spark and Apache Flink are emerging as big data platforms for large-scale parallel machine learning. Even though these big data frameworks are designed differently, they follow the data flow model for execution and user APIs. Data flow model offers fundamentally different capabilities than the MPI execution model, but the same type of parallelism can be used in applications developed in both models. This article presents three distinct machine learning algorithms implemented in MPI, Spark, and Flink and compares their performance and identifies strengths and weaknesses in each platform.
Geoffrey Charles Fox
added a research item
The basal topography of the Canadian Arctic Archipelago ice caps is unknown for a number of the glaciers which drain the ice caps. The basal topography is needed for calculating present sea level contribution using the surface mass balance and discharge method and to understand future sea level contributions using ice flow model studies. During the NASA Operation IceBridge (OIB) 2014 arctic campaign, the Multichannel Coherent Radar Depth Sounder (MCoRDS) used a three transmit beam setting (left beam, nadir beam, right beam) to illuminate a wide swath across the ice glacier in a single pass during three flights over the archipelago. In post processing we have used a combination of 3D imaging methods to produce images for each of the three beams which are then merged to produce a single digitally formed wide swath beam. Because of the high volume of data produced by 3D imaging, manual tracking of the ice bottom is impractical on a large scale. To solve this problem, we propose an automated technique for extracting ice bottom surfaces by viewing the task as an inference problem on a probabilistic graphical model. We first estimate layer boundaries to generate a seed surface, and then incorporate additional sources of evidence, such as ice masks, surface digital elevation models, and feedback from human users, to refine the surface in a discrete energy minimization formulation. We investigate the performance of the imaging and tracking algorithms using flight crossovers since crossing lines should produce consistent maps of the terrain beneath the ice surface and compare manually tracked “ground truth” to the automated tracking algorithms. We found the swath width at the nominal flight altitude of 1000 m to be approximately 3 km. Since many of the glaciers in the archipelago are narrower than this, the radar imaging, in these instances, was able to measure the full glacier cavity in a single pass.
Geoffrey Charles Fox
added a research item
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Geoffrey Charles Fox
added a research item
Accelerated loss of ice from Greenland and Antarctica has been observed in recent decades. The melting of polar ice sheets and mountain glaciers has considerable influence on sea level rise in a changing climate. Ice thickness is a key factor in making predictions about the future of massive ice reservoirs. The ice thickness can be estimated by calculating the exact location of the ice surface and subglacial topography beneath the ice in radar imagery. Identifying the locations of ice surface and bottom is typically performed manually, which is a very time-consuming procedure. Here, we propose an approach, which automatically detects ice surface and bottom boundaries using distance-regularized level-set evolution. In this approach, the complex topology of ice surface and bottom boundary layers can be detected simultaneously by evolving an initial curve in the radar imagery. Using a distance-regularized term, the regularity of the level-set function is intrinsically maintained, which solves the reinitialization issues arising from conventional level-set approaches. The results are evaluated on a large data set of airborne radar imagery collected during a NASA IceBridge mission over Antarctica and show promising results with respect to manually picked data.
Geoffrey Charles Fox
added 2 research items
With advent of Docker containers, an application deployment using container images gains popularity over scientific communities and major cloud providers to ease building reproducible environments. While a single base image can be imported multiple times from different containers to reduce storage consumption by a sharing technique, copy-on-write, duplicates of package dependencies are often observed over containers. In this paper, we propose new approaches to the container image management for eliminating duplicated dependencies. We create Common Core Components (3C) to share package dependencies by version control system commands; submodules and merge. 3C with submodules provides a collection of required libraries and tools in a separate branch, while keeping their base image same. 3C with merge offers a new base image including domain specific components thereby reducing duplicates in similar base images. Container images built with 3C enable efficient and compact software defined systems and disclose security information for tracking Common Vulnerability and Exposure (CVE). As a result, building application environments with 3C-enabled container images consumes less storage compared to existing Docker images. Dependency information for vulnerability is provided in detail for further developments.
Building compute environments needs to ensure reproducibility and constant deployment over time. Compute environments are usually supplied with several software packages and DevOps conducts a software deployment where comprehensive scripts have instructions to manage all dependencies but the execution of applications still need to be verified to confirm identical results. Software Defined Systems built through automated deployments can perform efficiently using Linux containers, and DevOps and Template infrastructure provisioning support in enabling big data software stacks on the cloud and HPC with these approaches.
Supun Kamburugamuve
added a research item
Worldwide data production is increasing both in volume and velocity, and with this acceleration, data needs to be processed in streaming settings as opposed to the traditional store and process model. Distributed streaming frameworks are designed to process such data in real time with reasonable time constraints. Twitter Heron is a production ready large scale distributed stream processing framework developed at Twitter. In order to scale streaming applications to large numbers of nodes, the network is of utmost importance. High performance computing (HPC) clusters feature interconnects that can perform at higher levels than traditional Ethernet. In this work the authors present their findings on integrating Twitter Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path.
Geoffrey Charles Fox
added 3 research items
Status of NSF 1443054 Project -------------------------------------------------- Big Data Application Analysis identifies features of data intensive applications that need to be supported in software and represented in benchmarks. This analysis was started for proposal and has been extended to support HPC-Simulations-Big Data convergence. The project is a collaboration between computer and domain scientists in application areas in Biomolecular Simulations, Network Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. HPC-ABDS as Cloud-HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack was a bold idea developed for proposal. We have successfully delivered and extended this approach, which is one of ideas described in Exascale Big Data report. MIDAS integrating middleware that links HPC and ABDS now has several components including an architecture for Big Data analytics, an integration of HPC in communication and scheduling on ABDS; it also has rules to get high performance Java scientific code. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) now has 20 members with domain specific (general) and core algorithms. Benchmarks. We reached out to database community with keynote and paper at WBDB2015 Benchmarking Workshop. Language: SPIDAL Java runs as fast as C++ Designed and Proposed HPCCloud as hardware-software infrastructure supporting Big Data Big Simulation Convergence Big Data Management via Apache Stack ABDS Big Data Analytics using SPIDAL and other libraries
Two major trends in computing systems are the growth in high performance computing (HPC) with an international exascale initiative, and the big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. This tutorial weaves these trends together using some key building blocks. The first is HPC-ABDS, the High Performance Computing (HPC) enhanced Apache Big Data Stack. (ABDS). Here we aim at using the major open source Big Data software environment but develop the principles allowing use of HPC software and hardware to achieve good performance. We give several examples of software (for example Hadoop and Heron) and algorithms implemented in this software. The second building block is the SPIDAL library (Scalable Parallel Interoperable Data Analytics Library) of scalable machine learning and data analysis software. We give examples including clustering, topic modeling and dimension reduction and their visualization. The third building block is an analysis of simulation and big data use cases in terms of 64 separate features (varying from data volume to " suitable for MapReduce " to kernel algorithm used). This allows an understanding of what type of hardware and software is needed for what type of exhibited features; it allows a one to either unify or distinguish applications across the simulation and Big Data regimes. The final building block is DevOps and Software defined Systems. These allow one to package software so it runs across a variety of hardware (albeit with varying performance) with just a mouse click. These building blocks are finally linked together as a proposed convergence of Big Data and Exascale Computing. This tutorial builds on work of a collaboration funded as NSF14-43054 started October 1, 2014. It contains descriptive material and several explicit hands-on tutorials. Much open source software is available. Tutorial plan is at http://dsc.soic.indiana.edu/publications/SPIDALTutorialProgram-Feb2017.pdf
Geoffrey Charles Fox
added a research item
Ground-penetrating radar systems are useful for a variety scientific studies, including monitoring changes to the polar ice sheets that may give clues to climate change. A key step in analyzing radar echograms is to identify boundaries between layers of material (such as air, ice, rock, etc.). In this paper, we propose an automated technique for identifying these boundaries, posing this as an inference problem on a probabilistic graphical model. We show how to learn model parameters from labeled training data and how to perform inference efficiently, as well as how additional sources of evidence, such as feedback from a human operator, can be naturally incorporated. We evaluate the approach on over 800 echograms of the Antarctic ice sheets, measuring error with respect to hand-labeled ground truth.
Geoffrey Charles Fox
added a research item
The Department of Energy (DOE) Office of Science (SC) facilities including accelerators, light sources and neutron sources and sensors that study, the environment, and the atmosphere, are producing streaming data that needs to be analyzed for next-generation scientific discoveries. There has been an explosion of new research and technologies for stream analytics arising from the academic and private sectors. However, there has been no corresponding effort in either documenting the critical research opportunities or building a community that can create and foster productive collaborations. The two-part workshop series, STREAM: Streaming Requirements, Experience, Applications and Middleware Workshop (STREAM2015 and STREAM2016), were conducted to bring the community together and identify gaps and future efforts needed by both NSF and DOE. This report describes the discussions, outcomes and conclusions from STREAM2016: Streaming Requirements, Experience, Applications and Middleware Workshop, the second of these workshops held on March 22-23, 2016 in Tysons, VA. STREAM2016 focused on the Department of Energy (DOE) applications, computational and experimental facilities, as well software systems. Thus, the role of “streaming and steering” as a critical mode of connecting the experimental and computing facilities was pervasive through the workshop. Given the overlap in interests and challenges with industry, the workshop had significant presence from several innovative companies and major contributors. The requirements that drive the proposed research directions, identified in this report, show an important opportunity for building competitive research and development program around streaming data. These findings and recommendations are consistent with vision outlined in NRC Frontiers of Data and National Strategic Computing Initiative (NCSI) [1, 2]. The discussions from the workshop are captured as topic areas covered in this report's sections. The report discusses four research directions driven by current and future application requirements reflecting the areas identified as important by STREAM2016. These include (i) Algorithms, (ii) Programming Models, Languages and Runtime Systems (iii) Human-in-the-loop and Steering in Scientific Workflow and (iv) Facilities.
Geoffrey Charles Fox
added a research item
Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS. Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.
Geoffrey Charles Fox
added 2 research items
The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes. We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues — thread models, affinity patterns, and communication mechanisms — as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPC-Big Data convergence.