Chapter

Towards High Performance Data Analytics for Climate Change

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The continuous increase in the data produced by simulations, experiments and edge components in the last few years has forced a shift in the scientific research process, leading to the definition of a fourth paradigm in Science, concerning data-intensive computing. This data deluge, in fact, introduces various challenges related to big data volumes, formats heterogeneity and the speed in the data production and gathering that must be handled to effectively support scientific discovery. To this end, High Performance Computing (HPC) and data analytics are both considered as fundamental and complementary aspects of the scientific process and together contribute to a new paradigm encompassing the efforts from the two fields called High Performance Data Analytics (HPDA). In this context, the Ophidia project provides a HPDA framework which joins the HPC paradigm with scientific data analytics. This contribution presents some aspects regarding the Ophidia HPDA framework, such as the multidimensional storage model, its distributed and hierarchical implementation along with a benchmark of a parallel in-memory time series reduction operator.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Each fragment is composed of a set of multi-dimensional binary arrays following a data store implementation based on a NoSQL approach. A more detailed and rigorous description of the storage model is provided in [92]. ...
... The undertaken tests focus on the execution of single operators in order to get a better understanding of the runtime system behaviour at the level of intra-task execution (i.e., the fragment-level parallelism). The benchmark moves beyond the scalability limits of a few hundreds of cores that have already been assessed in previous work [90], [92] with former versions of the framework. Moreover, initial experiments targeting inter-task behaviour at the level of the workflow (already performed on former versions of the framework as reported in [93]) will be carried out in future work after the full characterization of the framework scalability at the level of single operator. ...
... Consequently, given the proposed runtime system, it is best to rely on a higher number of threads rather than on MPI processes, that should only be exploited to scale over multiple nodes when it comes to larger scale scenarios. Overall, the proposed HPDA runtime system and deployment mechanisms have proven to scale effectively over a large number of threads and nodes in a supercomputing environment, overcoming by one order of magnitude the scalability limits that affected previous releases [92]. Moreover, they gave us better insight into the framework behaviour alongside its new runtime system, and also helped us identify aspects that need to be further improved and optimized in the future, thus bringing important feedback to the software roadmap. ...
Article
Full-text available
Over the last two decades, scientific discovery has increasingly been driven by the large availability of data from a multitude of sources, including high-resolution simulations, observations and instruments, as well as an enormous network of sensors and edge components. In such a dynamic and growing landscape where data continue to expand, advances in Science have become intertwined with the capacity of analysis tools to effectively handle and extract valuable information from this ocean of data. In view of the exascale era of supercomputers that is rapidly approaching, it is of the utmost importance to design novel solutions that can take full advantage of the upcoming computing infrastructures. The convergence of High Performance Computing (HPC) and data-intensive analytics is key to delivering scalable High Performance Data Analytics (HPDA) solutions for scientific and engineering applications. The aim of this paper is threefold: reviewing some of the most relevant challenges towards HPDA at scale, presenting a HPDA-enabled version of the Ophidia framework and validating the scalability of the proposed framework through an experimental performance evaluation carried out in the context of the Centre of Excellence in Simulation of Weather and Climate in Europe (ESiWACE). The experimental results show that the proposed solution is capable of scaling over several thousand cores and hundreds of cluster nodes. The proposed work is a contribution in support of scientific large-scale applications along the wider convergence path of HPC and Big Data followed by the scientific research community.
... HPC technology is becoming more and more readily available for academic and industrial applications [12]. Recent years have witnessed a skyrocketing interest towards HPC in many areas, such as healthcare [13], climate studies [14], or space exploration [15], and in general in various applications dealing with complex, parallelizable computations as well as the processing of high quantities of data. Comparatively, however, there has been much less research about the use of HPC in audio contexts, and this is especially true for audio in networked settings. ...
Article
The intensification of extreme events, storm surges and coastal flooding in a climate change scenario increasingly influences human processes, especially in coastal areas where sea-based activities are concentrated. Predicting sea level near the coasts, with a high accuracy and in a reasonable amount of time, becomes a strategic task. Despite the developments of complex numerical codes for high-resolution ocean modelling, the task of making forecasts in areas at the intersection between land and sea remains challenging. In this respect, the use of machine learning techniques can represent an interesting alternative to be investigated and evaluated by numerical modelers. This article presents the application of the Long-Short Term Memory (LSTM) neural network to the problem of short-term sea level forecasting in the Southern Adriatic Northern Ionian (SANI) domain in the Mediterranean sea. The proposed multi-model architecture based on LSTM networks has been trained to predict mean sea levels three days ahead, for different coastal locations. Predictions were compared with the observation data collected through the tide-gauge devices as well as with the forecasts produced by the Southern Adriatic Northern Ionian Forecasting System (SANIFS) developed at the Euro-Mediterranean Center on Climate Change (CMCC), which provides short-term daily updated forecasts in the Mediterranean basin. Experimental results demonstrate that the multi-model architecture is able to bridge information far in time and to produce predictions with a much higher accuracy than SANIFS forecasts.
Article
Full-text available
Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
Article
Full-text available
Daniel A. Reed and Jack Dongarra state that scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. Big data machine learning and predictive data analytics have been considered as the fourth paradigm of science, allowing researchers to extract insights from both scientific instruments and computational simulations. A rich ecosystem of hardware and software has emerged for big-data analytics similar to high-performance computing.
Conference Paper
Full-text available
In this paper, we give an overview of the HDF5 technology suite and some of its applications. We discuss the HDF5 data model, the HDF5 software architecture and some of its performance enhancing capabilities.
Conference Paper
Full-text available
Multidimensional discrete data (MDD), i.e., arrays of arbitrary size, dimension, and base type, are receiving growing attention among the database community. MDD occur in a variety of application fields, e.g., technical/scientific areas such as medical imaging, geographic information systems, climate research, scientific simulations, and businessoriented applications like OLAP and data mining. In all these application fields the data managed can be modeled as MDD. RasDaMan (Raster Data Management in Databases) is a basic research project sponsored by the European Community where industrial and research partners collaborate to develop comprehensive MDD database technology. In the approach adopted, the logical and physical levels are strictly separated. A data definition language for multidimensional arrays together with a declarative, optimizable query language allow for powerful associative retrieval. A streamlined storage manager for huge arrays enables fast, efficient access to MDD...
Article
Full-text available
“Exascale eScience infrastructures” will face important and critical challenges, both from computational and data perspectives. Increasingly complex and parallel scientific codes will lead to the production of a huge amount of data. The large volume of data and the time needed to locate, access, analyze and visualize data will greatly impact on the scientific productivity of scientists and researchers in several domains. Significant improvements in the data management field will increase research productivity in solving complex scientific problems. Next-generation eScience infrastructures will start from the assumption that exascale high-performance computing (HPC) applications (running on million of cores) will generate data at a very high rate (terabytes/s). Hundreds of exabytes of data (distributed across several centers) are expected, by 2020, to be available through heterogeneous storage resources for access, analysis, post-processing and other scientific activities.
Article
Full-text available
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
Article
Full-text available
The demands of data-intensive science represent a challenge for diverse scientific communities.
Article
Full-text available
This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.
Conference Paper
In the context of the EU H2020 INDIGO-DataCloud project several use case on large scale scientific data analysis regarding different research communities have been implemented. All of them require the availability of large amount of data related to either output of simulations or observed data from sensors and need scientific (big) data solutions to run data analysis experiments. More specifically, the paper presents the case studies related to the following research communities: (i) the European Multidisciplinary Seafloor and water column Observatory (INGV-EMSO), (ii) the Large Binocular Telescope, (iii) LifeWatch, and (iv) the European Network for Earth System Modelling (ENES).
Conference Paper
We present further work on SciSpark, a Big Data framework that extends Apache Spark's inmemory parallel computing to scale scientific computations. SciSpark's current architecture and design includes: time and space partitioning of highresolution geo-grids from NetCDF3/4; a sciDataset class providing N-dimensional array operations in Scala/Java and CF-style variable attributes (an update of our prior sciTensor class); parallel computation of time-series statistical metrics; and an interactive front-end using science (code & visualization) Notebooks. We demonstrate how SciSpark achieves parallel ingest and time/space partitioning of Earth science satellite and model datasets. We illustrate the usability, extensibility, and early performance of SciSpark using several Earth science Use cases, here presenting benchmarks for sciDataset Readers and parallel time-series analytics. A three-hour SciSpark tutorial was taught at an ESIP Federation meeting using a dozen “live” Notebooks.
Conference Paper
A case study on climate models intercomparison data analysis addressing several classes of multi-model experiments is being implemented in the context of the EU H2020 INDIGO-DataCloud project. Such experiments require the availability of large amount of data (multi-terabyte order) related to the output of several climate models simulations as well as the exploitation of scientific data management tools for large-scale data analytics. More specifically, the paper discusses in detail a use case on precipitation trend analysis in terms of requirements, architectural design solution, and infrastructural implementation. The experiment has been tested and validated on CMIP5 datasets, in the context of a large scale distributed testbed across EU and US involving three ESGF sites (LLNL, ORNL, and CMCC) and one central orchestrator site (PSNC).
Conference Paper
This work presents the I/O in-memory server implemented in the context of the Ophidia framework, a big data analytics stack addressing scientific data analysis of n-dimensional datasets. The provided I/O server represents a key component in the Ophidia 2.0 architecture proposed in this paper. It exploits (i) a NoSQL approach to manage scientific data at the storage level, (ii) user-defined functions to perform array-based analytics, (iii) the Ophidia Storage API to manage heterogeneous back-ends through a plugin-based approach, and (iv) an in-memory and parallel analytics engine to address high scalability and performance. Preliminary performance results about a statistical analytics kernel benchmark performed on a HPC cluster running at the CMCC SuperComputing Centre are provided in this paper.
Article
A description and discussion of the SciDB database management system focuses on lessons learned, application areas, performance comparisons against other solutions, and additional approaches to managing data and complex analytics.
Article
This work introduces Ophidia, a big data analytics research effort aiming at supporting the access, analysis and mining of scientific (n-dimensional array based) data. The Ophidia platform extends, in terms of both primitives and data types, current relational database system implementations (in particular MySQL) to enable efficient data analysis tasks on scientific array-based data. To enable big data analytics it exploits well-known scientific numerical libraries, a distributed and hierarchical storage model and a parallel software framework based on the Message Passing Interface to run from single tasks to more complex dataflows. The current version of the Ophidia platform is being tested on NetCDF data produced by CMCC climate scientists in the context of the international Coupled Model Intercomparison Project Phase 5 (CMIP5). (C) 2013 The Authors. Published by Elsevier B.V. Selection and peer review under responsibility of the organizers of the 2013 International Conference on Computational Science
Article
The netCDF Operator (NCO) software facilitates manipulation and analysis of gridded geoscience data stored in the self-describing netCDF format. NCO is optimized to efficiently analyze large multi-dimensional data sets spanning many files. Researchers and data centers often use NCO to analyze and serve observed and modeled geoscience data including satellite observations and weather, air quality, and climate forecasts. NCO's functionality includes shared memory threading, a message-passing interface, network transparency, and an interpreted language parser. NCO treats data files as a high level data type whose contents may be simultaneously manipulated by a single command. Institutions and data portals often use NCO for middleware to hyperslab and aggregate data set requests, while scientific researchers use NCO to perform three general functions: arithmetic operations, data permutation and compression, and metadata editing. We describe NCO's design philosophy and primary features, illustrate techniques to solve common geoscience and environmental data analysis problems, and suggest ways to design gridded data sets that can ease their subsequent analysis.
Conference Paper
SciDB [4, 3] is a new open-source data management system intended primarily for use in application domains that involve very large (petabyte) scale array data; for example, scientific applications such as astronomy, remote sensing and climate modeling, bio-science information management, risk management systems in financial applications, and the analysis of web log data. In this talk we will describe our set of motivating examples and use them to explain the features of SciDB. We then briefly give an overview of the project 'in flight', explaining our novel storage manager, array data model, query language, and extensibility frameworks.
Conference Paper
SciDB is an open-source analytical database oriented toward the data management needs of scientists. As such it mixes statistical and linear algebra operations with data management ones, using a natural nested multidimensional array data model. We have been working on the code for two years, most recently with the help of venture capital backing. Release 11.06 (June 2011) is downloadable from our website (SciDB.org). This paper presents the main design decisions of SciDB. It focuses on our decisions concerning a high-level, SQL-like query language, the issues facing our query optimizer and executor and efficient storage management for arrays. The paper also discusses implementation of features not usually present in DBMSs, including version control, uncertainty and provenance.
Scientific big data analytics challenges at large scale
  • G Aloisio
  • S Fiore
  • I Foster
  • D Williams
Data Warehouse Design: Modern Principles and Methodologies, 1st edn
  • M Golfarelli
  • S Rizzi
CDO user guide - version 1
  • U Schulzweida
The multidimensional database system RasDaMan
  • P Baumann
  • A Dehmel
  • P Furtado
  • R Ritsch
  • N Widmann