Conference Paper

High productivity processing - Engaging in big data around distributed computing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The steadily increasing amounts of scientific data and the analysis of 'big data' is a fundamental characteristic in the context of computational simulations that are based on numerical methods or known physical laws. This represents both an opportunity and challenge on different levels for traditional distributed computing approaches, architectures, and infrastructures. On the lowest level data-intensive computing is a challenge since CPU speed has surpassed IO capabilities of HPC resources and on the higher levels complex cross-disciplinary data sharing is envisioned via data infrastructures in order to engage in the fragmented answers to societal challenges. This paper highlights how these levels share the demand for 'high productivity processing' of 'big data' including the sharing and analysis of 'large-scale science data-sets'. The paper will describe approaches such as the high-level European data infrastructure EUDAT as well as low-level requirements arising from HPC simulations used in distributed computing. The paper aims to address the fact that big data analysis methods such as computational steering and visualization, map-reduce, R, and others are around, but a lot of research and evaluations still need to be done to achieve scientific insights with them in the context of traditional distributed computing infrastructures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
The term 'big data analytics' emerged in order to engage in the ever increasing amount of scientific and engineering data with general analytics techniques that support the often more domain-specific data analysis process. It is recognized that the big data challenge can only be adequately addressed when knowledge of various different fields such as data mining, machine learning algorithms, parallel processing, and data management practices are effectively combined. This paper thus describes some of the 'smart data analytics methods' that enable a high productivity data processing of large quantities of scientific data in order to enhance the data analysis efficiency. The paper thus aims to provide new insights how various fields can be successfully combined. Contributions of this paper include the concretization of the cross-industry standard process for data mining (CRISP-DM) process model in scientific environments using concrete machine learning algorithms (e.g. support vector machines that enable data classification) or data mining mechanisms (e.g. outlier detection in measurements). Serial and parallel approaches to specific data analysis challenges are discussed in the context of concrete earth science application data sets. Solutions also include various data visualizations that enable a better insight in the corresponding data analytics and analysis process.
Article
Full-text available
VisIt is a popular open source tool for visualizing and analyzing data. It owes its success to its foci of increasing data understanding, large data support, and providing a robust and usable product, as well as its underlying design that fits today's supercomputing landscape. In this short paper, we describe the VisIt project and its accomplishments.
Conference Paper
Full-text available
Today's large-scale scientific research often relies on the collaborative use of a Grid or c-Science infrastructure (e.g. DEISA, EGEE, TeraGrid, OSG) with computational, storage, or other types of physical resources. One of the goals of these emerging infrastructures is to support the work of scientists with advanced problem-solving tools. Many e-Science applications within these infrastructures aim at simulations of a scientific problem on powerful parallel computing resources. Typically, a researcher first performs a simulation for some fixed amount of time and then analyses results in a separate post-processing step, for instance, by viewing results in visualizations. In earlier work we have described early prototypes of a Collaborative Online Visualization and Steering (COVS) Framework in Grids that performs both -simulation and visualization -at the same time (online) to increase the efficiency of e-Scientists. This paper evaluates the evolved mature reference implementation of the COVS framework design that is ready for production usage within Web service-based Grid and e-Science infrastructures.
Conference Paper
The interoperability of e-Science infrastructures like DEISA/PRACE and EGEE/EGI is an increasing demand for a wide variety of cross-Grid applications, but interoperability based on common open standards adopted by Grid middleware is only starting to emerge and is not broadly provided today. In earlier work, we have shown how refined open standards form a reference model, which is based on careful academic analysis of lessons learned obtained from production cross-Grid applications that require access to both, High Throughput Computing (HTC) resources as well as High Performance Computing (HPC) resources. This paper provides insights in several concepts of this reference model with a particular focus on the finding of using HPC and HTC resources with the fusion applications BIT1 and a cross-infrastructure workflow based on the HELENA and ILSA fusion applications. Based on lessons learned over years gained with production interoperability setups and experimental interoperability work between production Grids like EGEE, DEISA, and NorduGrid, we illustrate how open Grid standards (e.g. OGSA-BES, JSDL, GLUE2, etc) can be used to overcome several limitations of the production architecture of the EUFORIA framework paving the way to a more standards-based and thus more maintainable and efficient solution.
Article
Streamline computation in a very large vector field data set represents a significant challenge due to the nonlocal and data-dependent nature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programming and execution as applied to streamline integration on a large, multicore platform. With multicore processors now prevalent in clusters and supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice. We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize over seeds and parallelize over blocks, and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing between cores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication and I/O bandwidth than a traditional, nonhybrid distributed implementation.