Chapter

Using Hadoop Ecosystem and Python to Explore Climate Change

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this article, the data storing and processing capabilities of Apache Hadoop ecosystem and its components: Hive, Impala and Sqoop are demonstrated. Also, it demonstrates Python programming language’s capabilities in data analysis, plotting and statistical computing. It does so by exploring climate change problem, one of today’s most relevant and detrimental problems. Apache Sqoop was employed to migrate data from RDBMS system and store it into Hive database, where Hive and Impala were used for data processing and ELT. Finally, data was analyzed using Python, showing strong evidence for global warming presence, as well as exploring relationship between carbon dioxide (CO2) emissions and climate change.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
—In this paper we are concerned with the practical issues of working with data sets common to finance, statistics, and other related fields. pandas is a new library which aims to facilitate working with these data sets and to provide a set of fundamental building blocks for implementing statistical models. We will discuss specific design issues encountered in the course of developing pandas with relevant examples and some comparisons with the R language. We conclude by discussing possible future directions for statistical computing and data analysis using Python.
Article
Full-text available
In the Python world, NumPy arrays are the standard representation for numerical data and enable efficient implementation of numerical computations in a high-level language. As this effort shows, NumPy performance can be improved through three techniques: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts.
Article
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Matplotlib: A 2D graphics environment
  • J D Hunter
  • JD Hunter
Hunter JD (2007) Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9(3): 90-95. DOI: 10.5281/zenodo.3984190
Python 3 Reference Manual. CreateSpace
  • G Van Rossum
  • Drake
  • Fl
Van Rossum G, Drake, FL (2009) Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, USA
  • TA Boden
  • G Marland
  • RJ Andres
Global temperature data
Berkeley Earth (2020). Global temperature data. http://berkeleyearth.org/data-new/. Accessed 18 Sep 2020.
Global, regional, and national fossil-fuel CO2 emissions. Carbon Dioxide Information Analysis Center
  • T A Boden
  • G Marland
  • R J Andres
Boden TA, Marland G, Andres RJ (2013) Global, regional, and national fossil-fuel CO2 emissions. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S. Department of Energy, Oak Ridge, Tenn., U.S.A. DOI:
Climate change 2014: Synthesis report
Intergovernmental Panel on Climate Change [IPCC] (2014) Climate change 2014: Synthesis report. In: The Core Writing Team, Pachauri RK, Meyer L (eds) Climate Change 2014: Synthesis Report. IPCC: Geneva, Switzerland.