ThesisPDF Available

Near Realtime Assessment of Fine Scale Spatiotemporal Weather Anomalies and Pollution Proxies Around Kathmandu Valley

Authors:

Abstract and Figures

Nepal, containing a rugged elevation ranging from less than 100 meters to over 8,848 meters and having various climate varying from tropical to alpine and perpetual snow has a great potential for the study of highly varying environment and weather proxies. Fine spatio-temporal scale measurements of such data using sufficiently distributed automatic weather stations are essential for such study. This report presents the methodology and algorithms implemented for studying and modelling weather and environment. In this project, the advantage of current information technology in the profiling of very fine spatio-temporal scale weather and environmental pollution data along some major road line of the country using mobile sensor instrumentation has been explored. To generate insights on varying environmental and pollution data, big data technology along with predictive analysis is considered to process enormous volumes of complex data, establish correlations when required and provide near real-time mapping analytic. The collected data have high dimension and large volume and thus requires robust data logging system, a proper data warehouse for storage and distributed system for data handling. Encryption algorithms has also been explored during transmission over public network for the purpose of data security and integrity. For analysis purpose of those spatio-temporal data, preliminary steps of data mining such as anomaly detection, removal of duplicate data and outliers and handling of missing data are followed by various supervised and unsupervised machine learning algorithms.The collected preliminary data-set from area around Kathmandu valley are able to map some interesting features and environmental proxies that are visualised and the patterns and variations in it are explored using various models such as ARIMA, RNN etc
Content may be subject to copyright.
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
NEAR REALTIME ASSESSMENT OF FINE SCALE SPATIOTEMPORAL
WEATHER ANOMALIES AND POLLUTION PROXIES AROUND KATHMANDU
VALLEY
By:
Alina Devkota (072/BCT/504)
Saloni Shikha (072/BCT/531)
Spandan Pyakurel (072/BCT/539)
Sushant Gautam (072/BCT/544)
A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND
COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENT
FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL
AUGUST 6, 2019
ii
The undersigned certify that they have read, and recommended to the Institute of Engi-
neering for acceptance, a project report entitled “Near Realtime Assessment of Fine Scale
Spatiotemporal Weather Anomalies and Pollution Proxies Around Kathmandu Valley”
submitted by Alina Devkota, Saloni Shikha, Spandan Pyakurel and Sushant Gautam in
partial fulfillment of the requirements for the Bachelor’s Degree in Computer Engineering.
Supervisor: Dr. Nanda Bikram Adhikari, Associate Professor
Department of Electronics and Computer Engineering
Institute of Engineering, Pulchowk Campus
Internal Examiner: Dr. Arun Kumar Timalsina, Associate Professor
Institute of Engineering, Pulchowk Campus
External Examiner: Mr. Min Prasad Aryal, Director
Nepal Telecommunications Authority
Coordinator: Mrs. Bibha Sthapit, Deputy Head
Department of Electronics and Computer Engineering
Institute of Engineering, Pulchowk Campus
DATE OF APPROVAL:
iii
COPYRIGHT
The authors have agreed that the Library, Department of Electronics and Computer Engi-
neering, Institute of Engineering, Pulchowk Campus may make this report freely available
for inspection. Moreover, the authors have agreed that permission for extensive copying of
this project report for scholarly purpose may be granted by the supervisors who supervised
the project work recorded herein or in their absence, by the Head of the Department wherein
the project report was done. It is understood that the recognition will be given to the authors
of this project and to the Department of Electronics and Computer Engineering, Pulchowk
Campus, Institute of Engineering in any use of the material of this report. Copying or pub-
lication or the other use of this report for financial gain without approval of the Department
of Electronics and Computer Engineering, Institute of Engineering, Pulchowk Campus and
authors’ written permission is strictly prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronics and Computer Engineering,
Institute of Engineering, Pulchowk Campus,
Lalitpur, Nepal
iv
ACKNOWLEDGEMENT
This project is prepared in partial fulfilment of the requirement for for the the bachelor’s
degree in Computer Engineering. First and foremost, we would like to express our sincere
gratitude towards Dr. Nanda Bikram Adhikari, our project supervisor for his constant
guidance, inspiring lectures and precious encouragement. Without his invaluable supervision
and suggestions, it would have been a difficult journey for us. His useful suggestions for this
whole work and cooperative behaviour are sincerely acknowledged.
We would like to thank the Department of Electronics and Computer Engineering at In-
stitute of Engineering, Pulchowk Campus for providing us opportunity of collaborative un-
dertaking which has helped us to implement the knowledge gained over these years as major
project for fourth year and develop a major project of our own that has greatly enhanced our
knowledge and provided us a new experience of teamwork.
We would also like to thank all of our friends who have directly and indirectly helped us in
doing this project. Last but not the least, we place a deep sense of appreciation to our family
members who have been constant source of inspiration for us.
Any kind of suggestion or criticism will be highly appreciated and acknowledged.
Authors:
Alina Devkota
Saloni Shikha
Spandan Pyakurel
Sushant Gautam
v
ABSTRACT
Nepal, containing a rugged elevation ranging from less than 100 meters to over 8,848 meters
and having various climate varying from tropical to alpine and perpetual snow has a great po-
tential for the study of highly varying environment and weather proxies. Fine spatio-temporal
scale measurements of such data using sufficiently distributed automatic weather stations are
essential for such study. This report presents the methodology and algorithms implemented
for studying and modelling weather and environment. In this project, the advantage of cur-
rent information technology in the profiling of very fine spatio-temporal scale weather and
environmental pollution data along some major road line of the country using mobile sen-
sor instrumentation has been explored. To generate insights on varying environmental and
pollution data, big data technology along with predictive analysis is considered to process
enormous volumes of complex data, establish correlations when required and provide near
real-time mapping analytic.
The collected data have high dimension and large volume and thus requires robust data log-
ging system, a proper data warehouse for storage and distributed system for data handling.
Encryption algorithms has also been explored during transmission over public network for
the purpose of data security and integrity. For analysis purpose of those spatio-temporal
data, preliminary steps of data mining such as anomaly detection, removal of duplicate data
and outliers and handling of missing data are followed by various supervised and unsu-
pervised machine learning algorithms.The collected preliminary data-set from area around
Kathmandu valley are able to map some interesting features and environmental proxies that
are visualised and the patterns and variations in it are explored using various models such as
ARIMA, RNN etc.
Keywords: Spatio-Temporal Resolution, Mobile Data logger, Big Data, Time Series Fore-
casting Model, ARIMA, ANN, LSTM, Pollution Proxies, Environmental Anomalies and Clas-
sifiers
vi
TABLE OF CONTENTS
TITLE PAGE i
LETTER OF APPROVAL ii
COPYRIGHT iii
ACKNOWLEDGEMENT iv
ABSTRACT v
TABLE OF CONTENTS viii
LIST OF FIGURES xi
LIST OF TABLES xii
LIST OF ABBREVIATIONS xiii
1 INTRODUCTION 1
1.1 Background.................................. 2
1.2 Motivation................................... 3
1.3 Objectives................................... 3
1.4 Problemstatement .............................. 4
1.5 ScopeofProject................................ 5
1.6 Understanding of Requirement . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Organisation of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 LITERATURE REVIEW 8
3 THEORETICAL BACKGROUND 9
3.1 GeneralSynopsis ............................... 9
3.2 D’Agostino’s K2Test............................. 11
3.3 Omnibus K2statistic ............................. 13
3.4 Dickey-FullerTest .............................. 13
3.5 ClusterAnalysis................................ 15
3.5.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 X-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 OutlierDetection ............................... 17
3.7 PredictiveModels............................... 18
vii
3.7.1 ARIMA and seasonal ARIMA models . . . . . . . . . . . . . . . . 19
3.7.2 Autocorrelations........................... 20
3.7.3 Time series modelling using regression . . . . . . . . . . . . . . . 20
4 METHODOLOGY 23
4.1 Software Development Approach . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 SystemBlockDiagram ............................ 25
4.3 Data Collection and Visualisation . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Dataretrieval............................. 28
4.3.2 Datacleaning............................. 29
4.3.3 Dataaggregation ........................... 30
4.3.4 Datasecurity ............................. 30
4.3.5 Dataendpoints ............................ 30
4.3.6 Datavisualisation........................... 31
4.4 DataAnalysis................................. 31
4.4.1 Classication............................. 33
4.4.2 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.3 Clustering............................... 34
4.4.4 Artificial neural network . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.5 Time series forecast modelling . . . . . . . . . . . . . . . . . . . . 36
5 SYSTEM DESIGN 37
5.1 Requirement Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . 41
5.2 Feasibility Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Operational feasibility . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Technical feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Economic feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.4 Legalfeasibility ........................... 43
5.2.5 Scheduling feasibility . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 UseCaseDiagram .............................. 44
5.4 ActivityDiagram ............................... 45
5.5 Class Diagram for System . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 ClassDiagramforData............................ 47
5.7 DatabaseSchema ............................... 48
5.8 SequenceDiagram .............................. 49
5.9 Communication Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.10DataFlowDiagram.............................. 51
viii
5.11DeploymentDiagram............................. 52
6 TOOLS AND TECHONOLOGIES 53
6.1 Python..................................... 53
6.2 Django..................................... 53
6.3 NumPy .................................... 53
6.4 Pandas..................................... 54
6.5 HTML/CSS.................................. 54
6.6 Javascript ................................... 54
6.7 PostgreSQL.................................. 54
6.8 Git....................................... 55
6.9 Leaet..................................... 55
6.10Rapidminer .................................. 55
6.11Hadoop .................................... 55
6.12WebSocket .................................. 56
6.13RestFramework................................ 57
7 RESULTS AND DISCUSSIONS 58
7.1 StatisticalAnalysis .............................. 58
7.1.1 Environmental anomalies . . . . . . . . . . . . . . . . . . . . . . . 58
7.1.2 Pollutionproxies ........................... 65
7.2 ModellingUsingRNN ............................ 68
7.3 ModellingUsingARIMA........................... 71
8 CONCLUSION 75
9 LIMITATIONS AND FUTURE ENHANCEMENTS 76
9.1 Limitations .................................. 76
9.2 FutureEnhancements............................. 77
REFERENCES 78
A APPENDIX 80
ix
List of Figures
3.1 An example of outlier detection . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 AtypicalRNNcell............................... 21
3.3 Overview of feature extraction model and forecast model using LSTM ar-
chitecture ................................... 21
3.4 LSTM architecture that takes in two inputs, output from the last hidden state
andobservation................................. 22
4.1 Design thinking with scrum software development cycle . . . . . . . . . . 23
4.2 Scrum software development cycle . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Systemblockdiagram............................. 26
4.4 Projectmethodology ............................. 26
4.5 Visualisation work-flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Overall methodology from remote data collection to visualisation . . . . . . 28
4.7 Overview of data collected from our sensors in logarithmic scale . . . . . . 29
4.8 Output from our API endpoint . . . . . . . . . . . . . . . . . . . . . . . . 31
4.9 Integrated tool used for data visualisation and analysis. . . . . . . . . . . . 32
4.10 Classification for mapping an input attribute set x into its class label y. . . . 33
4.11 Clustering of data-points . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.12 Basic structure of artificial neural network . . . . . . . . . . . . . . . . . . 35
4.13 An example of time series forecasting . . . . . . . . . . . . . . . . . . . . 36
5.1 Use-case diagram for web visualisation . . . . . . . . . . . . . . . . . . . 44
5.2 Activity diagram for web visualisation . . . . . . . . . . . . . . . . . . . . 45
x
5.3 Class diagram for system . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Classdiagramfordata ............................ 47
5.5 Databaseschema ............................... 48
5.6 Sequencediagram............................... 49
5.7 Communication diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.8 Data flow diagram for web visualisation . . . . . . . . . . . . . . . . . . . 51
5.9 Deployment diagram for web visualisation . . . . . . . . . . . . . . . . . . 52
7.1 Data distribution and skewness test for temperature . . . . . . . . . . . . . 59
7.2 Time series plot of temperature . . . . . . . . . . . . . . . . . . . . . . . . 60
7.3 Box plots of temperature yearly and quarterly . . . . . . . . . . . . . . . . 60
7.4 Normal probability plot for distribution of temperature. . . . . . . . . . . . 61
7.5 Average temperature re-sampled over one day and week . . . . . . . . . . 62
7.6 Average temperature re-sampled over one month, quarter and year. . . . . . 62
7.7 Mean temperature grouped by year, quarter, month and day. . . . . . . . . . 63
7.8 Temperaturebyyears............................. 63
7.9 Rolling mean and standard deviation . . . . . . . . . . . . . . . . . . . . . 64
7.10 Data distribution and skewness test for CO2 . . . . . . . . . . . . . . . . . 66
7.11TimeseriesplotofCO2............................ 66
7.12 Normal probability plot for distribution of CO2. . . . . . . . . . . . . . . . 66
7.13 Mean CO2 level grouped by year, quarter, month and day. . . . . . . . . . . 67
7.14 CO2 emission in weekdays and weekends . . . . . . . . . . . . . . . . . . 67
7.15 Prediction of temperature variation using RNN. . . . . . . . . . . . . . . . 68
xi
7.16 LSTM tensor graph used in training. . . . . . . . . . . . . . . . . . . . . . 70
7.17 Statistics of the air temperature data studied to observe its skewness, noise
andcorrelations................................. 71
7.18 Decomposition of the time-series air temperature data into trend, seasonality
andnoise. ................................... 71
7.19 One-step ahead forecast using continuous training and prediction of temper-
aturevariation. ................................ 72
7.20 Plot of auto-correlation function obtained from the data. . . . . . . . . . . 72
7.21 Continuous training and prediction of temperature variation using ARIMA. 74
7.22 Using ARIMA model to predict temperature variations. . . . . . . . . . . . 74
A.1 Projectschedule................................ 80
A.2 Homepage .................................. 80
A.3 Loginpage .................................. 81
A.4 Screenshot of dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.5 Screenshotofwebsite............................. 82
A.6 Data visualisation tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.7 Aboutuspage................................. 83
A.8 Sensor mounted on the top of vehicle for collecting weather data . . . . . . 83
A.9 Bus with sensor mounted on top that is used for collection of data from the
route of Kathmandu to Pashupatinagar . . . . . . . . . . . . . . . . . . . . 84
A.10 Data collection route path from Kathmandu to Pashupatinagar . . . . . . . 84
A.11 Epoch loss graph while LSTM model training. . . . . . . . . . . . . . . . . 85
A.12 Hyper parameter tuning output for various architectural variables. . . . . . 85
A.13 Inside a LSTM layer of the LSTM model. . . . . . . . . . . . . . . . . . . 86
xii
List of Tables
7.1 Raw data for weather (Kathmandu from 2010 to 2019). . . . . . . . . . . . 58
7.2 Data for weather after removing missing values and aggregation. . . . . . . 58
7.3 Temperature distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.4 Results of Dickey-Fuller test for temperature . . . . . . . . . . . . . . . . . 64
7.5 Raw data for pollution logged by our data logger. (Kathmandu from 2018 to
2019). ..................................... 65
7.6 Data for pollution after removing missing values and aggregation. . . . . . 65
7.7 CO2distribution ............................... 66
7.8 Results of Dickey-Fuller test for temperature. . . . . . . . . . . . . . . . . 68
7.9 Comparison of various structures of neural network . . . . . . . . . . . . . 69
xiii
LIST OF ABBREVIATIONS
ACF Auto-correlation Function
AI Artificial Intelligence
ANN Artificial Neural Network
API Application Package Interface
AR Auto-Regressive
ARIMA Auto-Regressive Integrated Moving Average
BIC The Bayesian information criterion
CNN Convolutional Neural Network
CSS Cascading Style Sheet
DHM Department of Hydrology and Meteorology
ECC Elliptic Curve Cryptography
GIS Geographic Information Systems
GPS Global Positioning System
GSM Global System for Mobile
HTML Hypertext Markup Language
HTTP Hypter Text Transfer Protocol
JS JavaScript
KDD Knowledge Discovery in Databases
KDE Kernel Density Estimation
LSTM Long Short Term Memory
MA Moving Average
MAE Mean Absolute Error
ML Machine Learning
MLP Multilayer Perceptron
MSE Mean Square Error
MVT Model View Template
ORDBMS Object-Relational Database Management System
OS Operating System
REST Representational State Transfer
RMSE Root Mean Square Error
RNN Recurrent Neural Network
SQL Structured Query Language
UI User Interface
VCS Version Control System
WHO World Health Organization
WMO World Meteorological Organisation
1
1. INTRODUCTION
Environmental pollution refers to the contamination of ecosystem and the surrounding at-
mosphere by different forms of pollutants. Climate change refers to the variation in normal
weather patterns caused due to pollution. The issue of environmental pollution and climate
change has become an international concern due to their unfavourable affects to the phys-
ical and biological entities of the environment. Air pollution and climate change influence
each other through complex interactions in the atmosphere. Air quality can impact climate
change and, conversely, climate change can impact air quality. Pollution problem and climate
change impacts have progressively worsened around the globe. The World Health Organisa-
tion estimates around 1.7 million child deaths due to the polluted environment[1]. Around
90% of the world’s population live in areas where air quality exceeds WHO guideline lim-
its. In addition, the latest urban air quality database shows that 98% of cities in low- and
middle-income nations like Nepal do not meet WHO air quality guidelines. However, in
high-income countries, that percentage decreases to 56% .
The world has seen a lot of changes from the last century. Industrialisation and urbanisation
have brought a huge change in how people survive and consume natural resources. A recent
report has found that over the past two decades, we’ve destroyed a tenth of the world’s
wilderness. Land conversion for settlement, logging for woods, agriculture, and minerals
exploration are the key reasons behind this. Any farmer can tell from their experiences, the
natural patterns of climate have changed and this is not phenomenal. The planet is getting
hotter. Researches and evidence all blame humans for the global climate change. Whether
the cause is due to human activity or natural variability, it has been evident that thermometer
readings have increased steadily since the beginning of the industrial era.
The climate and environment are a complicated, interactive system consisting of numerous
components including the atmosphere, land surface, water bodies, land-forms and living
ecosystem. The atmospheric component of the climate system is what we generally refer to
as climate: the climate is often defined as ‘average weather’. A combination of observations
from various equipment and sensors can be used to construct models which can further be
used to understand climate science.
Scientists have contributed a lot to understanding science of atmospheric processes in the last
few decades. Accelerating rate of progress has been seen, particularly in field research and
notably through the evolution of advanced climate variations measuring methodology and
tools, including the climate prediction models and observations that encourage and enable
the analysis.
2
1.1. Background
The project focuses on the analysis and visualisation of the weather conditions and pollution
in different places around Kathmandu valley with the help of information gathered through
existing weather stations and explores the possibilities of deploying advanced data loggers
for effective continuous data collection in the context of Nepal. Currently, in Nepal, various
organisations have deployed on-site/ remote sensors and equipment for meteorological data
collections. Department of Hydrology and Meteorology (DHM), which has equipment with
WMO (World Meteorological Organisation) standards has three regional climate offices at
Dharan, Pokhara and Surkhet; three basin offices at Biratnagar, Bharatpur and Nepalgunj ba-
sically for hydrological observation. With a total of sixteen synoptic stations and numerous
hydrological and meteorological stations throughout the country, DHM observes the climatic
conditions and variations. Nepal’s capital city Kathmandu with pollution index 96.05 was
ranked as the second most polluted city in the world according to the 2018 Mid-Year Pol-
lution Index and the most polluted city in Asia. The situation is no longer just a general
topic of conversation as it has started posing serious threats to the health of the general pub-
lic. According to the DoE (Department of Environment), the particulate matter (PM 2.5) of
Kathmandu’s heart- Ratnapark is beyond 105 µgm3listing the capital city of ‘naturally rich
and beautiful country’ as one of the harmful cities to reside[2].
Meteorological data collection is not any new topic but the recent developments in the data
collection process from the remote manual to automatic weather and hydrological stations
in real-time central database server using GPRS based internet and SMS facilities has shown
a way to the development of multipurpose highly advanced mobile remote portable data
collection equipment. Public transportation equipped with remotely operated weather and
pollution sensors can be utilised for data collection in public road networks. Ever since the
evolution of probe vehicles and sensors equipped transportation, it has been used for road
planning and driving behavioural studies[3].
On the other hand, as sophisticated Advanced Driver Assistance Systems and Autonomous
Vehicle advances, the number of sensor-equipped vehicles will drastically rise. The real-
time data collection of these vehicles that also include its GPS trace, its speed, the wear and
tear on its components and even road conditions can be used for numerous purposes like
analysing telematics and driver behaviour data which can be analysed in real-time to keep
the vehicle’s performance, efficiency, and safety. It can also contribute primarily to provides
important information on cities about real-time traffic volume and roadway architectures[4].
However, there can be a serious issue of privacy concerns on data collection and the possible
cases of privacy violations. It is still a challenge to securely collect those data.
3
1.2. Motivation
Although the least urbanised country in Asia[5], Nepal is still fast growing country. Ur-
banisation and migration towards cities having more sophisticated facilities have highly in-
fluenced major population in Nepal, making the capital of Nepal the most populated and
polluted city. The city is holding people far beyond what the current infrastructure can sup-
port. Unplanned drainage, vehicle network, use of land and river for disposing heavy amount
of pollutants and crowd of so many people have left the city unacceptable to live as a heavy
creature of environment.
At the same time, Government of Nepal is concerned about the situation of Kathmandu
valley. Hence, it has installed various pollution meters in various places. As a student in the
field of Computer Science technology, the monitoring of pollution as well as weather levels
and visualising them came as a great responsibility to us. We also thought that the concerned
authority might find those resources useful in planning and minimising the conditions.
1.3. Objectives
There is no single method or instrument for measuring weather and pollution proxies. In-
stead, there are numerous methods in use, instruments installed and organisations actively
involved around the globe for the same purpose. In studying and modelling climate and
environment, it is essential to combine many diverse disciplines, including meteorology,
geomorphology, geology, oceanography and paleoclimatology. Apart from combining inter-
disciplinary studies, observations and measurements should be assembled continuously over
a significant period of time, using various measuring techniques. This project aims a broader
objective to devise a concrete framework for related data collection and warehousing, main-
taining data integrity, analysing collected data using mathematical models and developing
interface to present the outcomes. The main objectives of this project can be summed up as:
1. To study about existing remote sensors and data loggers currently deployed for mea-
suring weather and pollution related proxies and explore possible enhancements in
them for effective and centralised real time data collection.
2. Develop a proof-of-concept advance sensor for remote data sensing and logging and
to study the possibilities of deploying various advanced data loggers with relevant
sensors.
3. To remotely collect continuous data about the weather and pollution conditions in
different places around Kathmandu valley and to analyse it
4
4. Explore and collect various publicly available data sources for weather and pollution
related proxies that has been published in various journals and stored in various ware-
house.
5. Design an efficient data warehouse for centralised access of distributed data
6. To visualise the data obtained after various sources and explore the patterns and varia-
tions in it using various mathematical models
7. To study the development of mathematical models capable of predicting the weather
behaviour using Artificial Neural Network (ANN), ARIMA, etc and verify whether
the model has the potential to successful application to weather forecasting.
1.4. Problem statement
Nepal, being a developing country has been facing various difficulties to adapt with chang-
ing climatic conditions. Further, the increasing level of pollution has adverse effects on the
health of people and animals, cultural heritage and natural resources. But the study of cli-
matic and pollution conditions in Nepal has not been done so effectively since the process is
dependent on various data from a variety of data sources and equipment. So, the need of the
massive data collection, their visualisation and prediction of the future conditions is a must.
Pollution level is a varying quantity from place to place and measuring its variation from
ground level is a challenge in itself. Measuring the level of PM2.5 concentration in the air
we breathe and exploring its key sources is the door to control the presence of harmful com-
positions in the air. Although, on-ground monitoring of pollution proxies including PM2.5
requires advanced, complex and costly hardware, high funding, technical and physical chal-
lenges, there is no better and accurate alternatives for ground level monitoring. Powerful
satellite-driven technologies are also used as a supplementary tool for calculating large-scale
exposure to pollutants. With present progress in measuring pollution in Kathmandu compris-
ing of limited fixed on-ground equipment, it is really difficult to build up pollution profile of
the whole city.
In this context, advanced sensors equipped public vehicles can be a good option for data
collection that can measure both quantitative and spatial variations. However, obtaining and
storing public vehicle GPS trace can raise serious privacy issue. The size of continuously
pooled data from numbers of sensors with public GPS trace with various parameters is also
very high and the task of maintaining such huge database is also challenging. Likewise,
maintaining data integrity is also important on the database. If the number of mobile sensors
5
is high, a firm architecture is required to sustain the collection the data remote from mobile
sensors and its proper storage.
The amount of research in this field has not been as ample as it needs to be. This is because
studying environment and its components is a complex topic in itself. There are varieties
of parameters involved in the ecosystem which contribute equally as environmental con-
stituents. Our science and current human progress are far behind from measuring all those
components and simulate our environment. However, with advancement in technology and
science, researchers and scientists have achieved better way to study key components which
are mostly of our interests.
On the other hand, in developing countries like Nepal, which seconds the list of the most
polluted cities in the world, the progress in this field is poor due to ignorance on the topic
by stakeholders, limitations in financial funding, lack of sensors, hardware and equipment,
lower priority due to underlying condition of underdeveloped infrastructures, and lower num-
ber of researchers involved in the field.
1.5. Scope of Project
The project focuses on data retrieval, collection, visualisation and looks forward to the pre-
diction of weather and pollution proxies using mathematical modelling. The collection of
data required for processing is also done through the sensors in the vehicles around the
Kathmandu valley. The methods used for the deployment of various advanced data loggers
to collect continuous data about the weather and pollution condition can be extended in other
projects as well; where the data collection is an important step and can be used for other ap-
plications. Database and architectures can be designed to provide users access to the data
through the public API interface.
In this project, we will be dealing with spatial data of GPS traces of public vehicles, which
is an opportunity for us. Such data can not only be used for locating vehicles but also
for traffic patterns, road traffic density predictions, road planning and driving behavioural
studies. Collecting traces of public vehicles in real-time is another opportunity for us which
can further be used for public surveillance and data analysis. However, in this project, we
are deeply focused on data visualisations and processing for pollution and weather proxies.
On the other hand, this project covers the warehousing of public vehicles traces which is
very important as well as sensitive from a privacy point of view. Maintaining data integrity
for such large and may be distributed data resource to provide a reliable and trustworthy way
6
of storing sensitive data is also another valuable part of this project. Advanced technologies
for data warehousing like the Big Data Stack can be explored to provide solutions for data
storage. Such a study can also help in formulating concrete distributed data framework at the
national level for warehousing of public vehicles traces, which can open a door to a wider
area of research including optimising vehicles traffic and road planning.
Likewise, implementing data Integrity, we can ensure security risks by proactively detecting
threats caused by data tampering or data corruption. Using such implementations, we can
design publicly usable service that can be used to ensure that data has not been altered since
it was signed. Such robust system can also be implemented as a national project under
government to provide evidence and proof of legal and regulatory compliance.
In addition, another important aspect of this project is the analysis and mathematical mod-
elling of data collected. Through the prediction of the behaviour and the rate of climate
change, the knowledge of most polluted, moderately polluted and least polluted areas can
be retrieved. Also, knowledge related to different weather parameters can also be retrieved.
Such information can be very useful to stakeholders for strategic planning and decision-
making process. Insights based on those data might also help to identify pollution sources
around the city and support authority to take actions for mitigation.
The visualisation of weather proxies and pollution level depends on the data collected from
the mobile sensors as well as stationary stations and the derived data. Sufficient and correct
data is the most important requirement in order to accurately model the proxies. It helps for
the correct prediction of weather and pollution conditions around the Kathmandu valley.
1.6. Understanding of Requirement
The system designed as a result of the project is required to visualise the spatio-temporal
weather and pollution data along various places of Nepal. The available sensors should be
studied for the process and made more sophisticated for the purpose of measuring accurate
and precise data. The nature of data parameters, volume must be examined carefully to
design database architecture. For huge volume of data, big data architecture should be used.
Raw data obtained from sensors should be carefully observed to check the quality of data for
further processing. Data cleaning should be done to remove errors, fix missing values and
remove highly correlate data attributes and outliers in the data-set.
The visualisation methods used should be clear enough so that anyone viewing those visu-
alisations in able to understand it. To make visualisations clear, combination of available
tools are required to be applied as data-set consists of combination of different type of data
7
attributes. For the purpose of security of data, encryption algorithm should be implemented
while transmitting over public network. Also, for the access of important features of system,
authentication method should be applied where only privileged users should be able to ac-
cess important features of the system while general users should be able to access some of
the features using the sampled data only.
1.7. Organisation of the Report
The organisation of the report is done in the following ways:
1. Chapter 1: It includes the introduction about the problem and the method we are trying
to employ to solve.
2. Chapter 2: It includes Literature Review which includes the works related to the
project and the notable works prevailing prior to this project development with their
results.
3. Chapter 3: It includes the theoretical background for the development of the project.
4. Chapter 4: It includes methodology used for the development of the project.
5. Chapter 5: It includes system design techniques along with the use case, activity dia-
gram used for the development of the system.
6. Chapter 6: It includes tools and technologies used for the development of the system.
7. Chapter 7: It includes the analysis and the result of the experiment we tried in the
project.
8. Chapter 8: It includes the limitation and future enhancements of the project.
9. Chapter 9: It describes the conclusion of the project.
8
2. LITERATURE REVIEW
Several works have been done previously and in the field of data collection air quality mea-
surements. OpenSense[6], a Nano-Tera project, exploits the crowd-sourcing technique where
users are incentivised to make available data based on physical measurements such as loca-
tion and pollution through their monitoring assets or personal mobile devices. Also disper-
sion model has been used to compute air pollution map that also helps to assess the quality
of the sensor data and also to check how suitable they are to measure the pollution level [7].
As Geographic Information Systems (GIS) and data modelling techniques are becoming pop-
ular and powerful tools in analysis of environmental proxies, spatial data of environmental
variables are highly valued and in need[8]. For a statistical model, estimation models are
constructed based on measurements. In the literature, data-sets are collection of data re-
sulted from continuous and simultaneous measurements from a number of sources. Spatial
interpolation has been in practice using various techniques including Gaussian Process re-
gression, also known as Kriging[9]. With this advanced geostatistical procedure estimated
surface can be generated from a scattered set of points with z-values. Likewise, Land-use
information can also be one of the parameters to build up the model[10]. GPS traces and
statistical analysis of in- vehicle GPS trace data has been used for various purposes includ-
ing incremental map generation[11].Instead of using stationary sensors, GPS enabled mobile
sensors are in use for economic way for various roads mapping and spatial data collection
applications and has proven to have high potential for real time GPS trace embedded remote
data collection[12].
Storing publicly collected data including GPS traces with the vision of data-driven society
has been already common. Leaving the benefits of a data-driven society on one side, there is
a growing concern worldwide about privacy and misuse of such data[13]. There have been
researches and various attempts to address sensitive and privacy-concerned issues[14], both
from a stakeholder and from a technical viewpoint. Centralised data house stores massive
amount of information which data contributor party have little or no control over. Likewise,
validity and integrity of data stored is also an important topic to be considered[15]. The data
that is stored and how it is used should always be secure and correct.
There are a number of ways that can be used for the visualisation of the data obtained.
Various approaches have been used for the clustering of existing weather data which may
further be used for weather forecasting. S Chakraborty, et al.[16] have used the K-means
clustering algorithm in weather forecasting for clustering the data with the use of a dynamic
database.
9
3. THEORETICAL BACKGROUND
3.1. General Synopsis
The project requires knowledge of interdisciplinary sub-field of computer science such as
networking, security, database, web development, visualisation techniques, distributed stor-
age, artificial intelligence, remote sensing and probability and statistics with an overall goal
to extract information from a data set and transform the information into a comprehensible
structure for further use.
Randomness and uncertainty are imperative in the world and thus, can prove to be immensely
helpful to understand and know the chances of various events.Probability helps in making
informed decisions about likelihood of events, based on a pattern of collected data.
Statistics is the study of collection, interpretation, organisation analysis and organisation
of data. Statistical inferences are often used to analyse or predict trends from data, and
these inferences use probability distributions of data. Descriptive statistics together with
probability theory can help them in making useful decisions.
Data visualisation is the presentation of data in a pictorial or graphical format. It enables data
scientists to see analytic presented visually, so they can grasp difficult concepts or identify
new patterns. With interactive visualisation, one can take the concept a step further by using
technology to drill down into charts and graphs for more detail, interactively changing what
data is seen and how it’s processed. networking.
Web development is the work involved in developing a web site for the Internet (World
Wide Web) or an intranet (a private network). Web development can range from developing
a simple single static page of plain text to complex web-based internet applications (web
apps), electronic businesses, and social network services. Many front-end development tools
accelerate web development. Front-end development is a foremost part of the web and has
matured with multirole in the precedent years. They help improve user engagement, site
efficiency and better website look. This all factors helps to enhance the visibility over the
digital platform.
But the technology and programming that power a site, what end user does not see but what
makes the site run, is called the back end. Consisting of the server, the database, and the
server-side applications, it’s the behind-the-scenes functionality of a site.
10
With the recent advance in user interface (UI) technologies, expectations about the UI in-
creased considerably. Common frameworks for data scientists to create rich web applications
requires basic skills in web development. Front-end analytics require analysis performed on
all available data, to gain insights into patterns and interesting information in data.
There are a number of different approaches available for facilitating rapid data access, the
major choices being flat files, traditional databases, and the emergent NoSQL paradigm.
Each of these designs offers different strengths and weaknesses based on the structure of the
data stored and the skills of the analysts involved. Flatfiles, Relational databases, Hadoop,
MongoDB, Redis etc are some of such databases.
Choosing the right data system is a function of the volume of data stored, the type of data
stored, and the population that’s going to analyse it.
Distributed storage is an attempt to offer the advantages of centralized storage with the scala-
bility and cost base of local storage. A distributed object store is made up of many individual
object stores, normally consisting of one or a small number of physical disks. A distributed
object store is made up of many individual object stores, normally consisting of one or a
small number of physical disks. These object stores run on commodity server hardware,
which might be the compute nodes or might be separate servers configured solely for pro-
viding storage services.
Data mining is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems.Data mining is the analysis
step of the ”knowledge discovery in databases” process, or KDD. Aside from the raw analy-
sis step, it also involves database and data management aspects, data pre-processing, model
and inference considerations, interestingness metrics, complexity considerations, post-processing
of discovered structures, visualisation, and online updating.
Remote sensing is the process of detecting and monitoring the physical characteristics of an
area by measuring its reflected and emitted radiation at a distance from the targeted area.
It may be split into ”active” remote sensing (such as when a signal is emitted by a satellite
or aircraft and its reflection by the object is detected by the sensor) and ”passive” remote
sensing (such as when the reflection of sunlight is detected by the sensor). We mostly deal
with passive remote sensing in our project.
Collection of data to central server from multiple sensors require some communication
medium to transfer the collected data through public network. During such transmissions,
data security and integrity is of primary concern.Network security is the practice of prevent-
ing and protecting against unauthorised intrusion into corporate networks. It complements
11
endpoint security, which focuses on individual devices; network security instead focuses on
how those devices interact, and on the connective tissue between them. Security can be
accomplished at various layers of communication protocols such as application layer, TCP
layer, IP layer, SSL layer etc.
AI or artificial intelligence is the simulation of human intelligence processes by machines, es-
pecially computer systems. These processes include learning, reasoning and self-correction.
Some of the applications of AI include expert systems, speech recognition and machine
vision. Machine Learning, a powerful technology of AI provides algorithms, APIs (Applica-
tion Program interface) development and training tool-kits, data, as well as computing power
to design, train, and deploy models into applications, processes, and other machines.
3.2. D’Agostino’s K2Test
D’Agostino’s K2test, named for Ralph D’Agostino, is a goodness-of-fit measure of depar-
ture from normal distribution/normality, that is the test aims to establish whether or not the
given sample comes from a normally distributed population. The test is based on transfor-
mations of the sample kurtosis and skewness, and has power only against the alternatives
that the distribution is skewed and/or kurtic.
Skewness and kurtosis
In the following, {xi}denotes a sample of n observations, g1and g2are the sample skewness
and kurtosis, mj’s are the j-th sample central moments, and xis the sample mean. Frequently
in the literature related to normality testing, the skewness and kurtosis are denoted as pβ1
and β2respectively.
The sample skewness and kurtosis are defined as:
g1=m3
m3/2
2
=
1
nn
i=1(xi¯x)3
1
nn
i=1(xi¯x)23/2(3.1)
g2=m4
m2
23=
1
nn
i=1(xi¯x)4
1
nn
i=1(xi¯x)223 (3.2)
These quantities consistently estimate the theoretical skewness and kurtosis of the distribu-
tion, respectively. Moreover, if the sample indeed comes from a normal population, then the
exact finite sample distributions of the skewness and kurtosis can themselves be analysed in
terms of their means µ1, variances µ2, skewnesses γ1, and kurtoses γ2. This has been done
by Pearson [17], who derived the following expressions:
12
µ1(g1) = 0 (3.3)
µ2(g1) = 6(n2)
(n+1)(n+3)(3.4)
γ1(g1)µ3(g1)
µ2(g1)3/2=0 (3.5)
γ2(g1)µ4(g1)
µ2(g1)23=36(n7)(n2+2n5)
(n2)(n+5)(n+7)(n+9)(3.6)
and
µ1(g2) = 6
n+1(3.7)
µ2(g2) = 24n(n2)(n3)
(n+1)2(n+3)(n+5)(3.8)
γ1(g2)µ3(g2)
µ2(g2)3/2=6(n25n+2)
(n+7)(n+9)s6(n+3)(n+5)
n(n2)(n3)(3.9)
γ2(g2)µ4(g2)
µ2(g2)23=36(15n636n5628n4+982n3+5777n26402n+900)
n(n3)(n2)(n+7)(n+9)(n+11)(n+13)
(3.10)
Transformed sample skewness and kurtosis
The sample skewness g1and kurtosis g2are both asymptotically normal. However, the rate
of their convergence to the distribution limit is frustratingly slow, especially for g2. In order
to remedy this situation, it has been suggested to transform the quantities g1and g2in a way
that makes their distribution as close to standard normal as possible.
In particular, D’Agostino [18] suggested the following transformation for sample skewness:
Z1(g1) = δasinhg1
αµ2(3.11)
where constants αand δare computed as:
W2=p2γ2+41,(3.12)
δ=1/lnW,(3.13)
α2=2/(W21),(3.14)
and, where µ2=µ2(g1)is the variance of g1, and γ2=γ2(g1)is the kurtosis. Similarly,
Anscombe & Glynn [19] suggested a transformation for g2, which works reasonably well
13
for sample sizes of 20 or greater:
Z2(g2) = r9A
2
12
9A 12/A
1+g2µ1
µ2p2/(A4)!1/3
,(3.15)
where,
A=6+8
γ12
γ1
+q1+4/γ2
1,(3.16)
and, µ1=µ1(g2),µ2=µ2(g2),γ1=γ1(g2)are the quantities computed by Pearson.
3.3. Omnibus K2statistic
Statistics Z1and Z2can be combined to produce an omnibus test, able to detect deviations
from normality due to either skewness or kurtosis :
K2=Z1(g1)2+Z2(g2)2(3.17)
If the null hypothesis of normality is true, then K2is approximately chi-squared distributed
with 2 degrees of freedom.
3.4. Dickey-Fuller Test
Dickey–Fuller test’s null hypothesis is that a unit root is present in an autoregressive model.
The alternative hypothesis is different depending on which version of the test is used, but is
usually stationarity or trend-stationarity. It is named after the statisticians David Dickey and
Wayne Fuller, who developed the test in 1979[20]. A simple AR(1) model is
yt=ρyt1+ut(3.18)
where ytis the variable of interest, tis the time index, ρis a coefficient, and µtis the error
term. A unit root is present if ρ=1. The model would be non-stationary in this case. The
regression model can be written as
yt= (ρ1)yt1+ut=δyt1+ut(3.19)
where is the first difference operator. This model can be estimated and testing for a unit
root is equivalent to testing δ=0 (where δρ1).
14
Since the test is done over the residual term rather than raw data, it is not possible to use
standard t-distribution to provide critical values. Therefore, this statistic thas a specific
distribution simply known as the Dickey–Fuller table.
There are three main versions of the test:
1. Test for a unit root:
yt=δyt1+ut(3.20)
2. Test for a unit root with drift:
yt=a0+δyt1+ut(3.21)
3. Test for a unit root with drift and deterministic time trend:
yt=a0+a1t+δyt1+ut(3.22)
Each version of the test has its own critical value which depends on the size of the sample. In
each case, the null hypothesis is that there is a unit root, δ=0. The tests have low statistical
power in that they often cannot distinguish between true unit-root processes (δ=0) and near
unit-root processes (δis close to zero). This is called the ”near observation equivalence”
problem.
The intuition behind the test is as follows. If the series yis stationary (or trend stationary),
then it has a tendency to return to a constant (or deterministically trending) mean. Therefore,
large values will tend to be followed by smaller values (negative changes), and small values
by larger values (positive changes). Accordingly, the level of the series will be a significant
predictor of next period’s change, and will have a negative coefficient.
If, on the other hand, the series is integrated, then positive changes and negative changes will
occur with probabilities that do not depend on the current level of the series; in a random
walk, where you are now does not affect which way you will go next.
15
It is notable that
yt=a0+ut(3.23)
may be rewritten as
yt=y0+
t
i=1
ui+a0t(3.24)
with a deterministic trend coming from a0tand a stochastic intercept term coming from
y0+t
i=1µi, resulting in what is referred to as a stochastic trend [21].
3.5. Cluster Analysis
Cluster analysis, also known as unsupervised Classification, groups data objects based only
on information found in the data that describes the objects and their relationships. The goal
is that the objects within a group be similar (or related) to one another and different from
(or unrelated to) the objects in other groups. The greater the similarity (or homogeneity)
within a group and the greater the difference between groups, the better or more distinct the
clustering.
Cluster analysis provides an abstraction from individual data objects to the clusters in which
those data objects reside. Additionally, some clustering techniques characterize each cluster
in terms of a cluster prototype; i.e., a data object that is representative of the other objects in
the cluster. These cluster prototypes can be used as the basis for a number of data analysis
or data processing techniques.
3.5.1. K-means clustering
This is a prototype-based, partitional clustering technique that attempts to find a user-specified
number of clusters (K), which are represented by their centroids. K-means defines a proto-
type in terms of a centroid, which is usually the mean of a group of points, and is typically
applied to objects in a continuous n-dimensional space.
We first choose K initial centroids, where K is a user specified parameter, namely, the number
of clusters desired. Each point is then assigned to the closest centroid, and each collection of
points assigned to a centroid is a cluster. The centroid of each cluster is then updated based
on the points assigned to the cluster.
16
We repeat the assignment and update steps until no point changes clusters, or equivalently,
until the centroids remain the same.
Basic K-means algorithm
Select Kpoints as initial centroids.
repeat
Form Kclusters by assigning each point to its closest centroid.
Recompute the centroid of each cluster.
until Centroids do not change.
3.5.2. X-means clustering
In statistics and data mining, X-means clustering is a variation of k-means clustering that
refines cluster assignments by repeatedly attempting subdivision, and keeping the best re-
sulting splits, until some criterion is reached. The Bayesian information criterion(BIC) is
used to make the splitting decision[22]. X-means clustering is the extension of K-means
clustering which calculates the optimum value of k and performs clustering.
The outline of this algorithm is :
Perform 2-means. This gives us clustering C
Evaluate the relevance of the classification C with a BIC Criterion
Iterate step one and two in each cell of C. Keep going until there is no more relevant
discrimination
Bayesian information criterion
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also
SIC, SBC, SBIC) is a criterion for model selection among a finite set of models. Comparing
models with the Bayesian information criterion simply involves calculating the BIC for each
model. The model with the lowest BIC is considered the best.The Bayesian Information
Criterion (BIC) is defined as:
BIC =2ln(L) + 2ln(N)k(3.25)
17
3.6. Outlier Detection
In outlier detection, the goal is to find objects that are different from most other objects.
Often, anomalous objects are known as outliers, since, on a scatter plot of the data, they lie
far away from other data points(see Figure 3.1). Outlier detection is also known as deviation
detection, because outliers have attribute values that deviate significantly from the expected
or typical attribute values, or as exception mining, because outliers are exceptional in some
sense.
Figure 3.1: An example of outlier detection
Outlier can be defined as data in observed dataset, which is so different from rest of other
data samples that it creates a suspicion[23]. some common causes of outliers are: data
from different classes, natural variation, and data measurement or collection errors. Various
methods can be used to detect outliers:
Model-Based Techniques:
Many anomaly detection techniques first build a model of the data. Anomalies are ob-
jects that do not fit the model very well.Because anomalous and normal objects can be
viewed as defining two distinct classes, classification techniques can be used for build-
ing models of these two classes. However, classification techniques can only be used
if class labels are available for some of the objects so that a training set can be con-
structed. Also, outliers are relatively rare, and this needs to be taken into account when
choosing both a classification technique and the measures to be used for evaluation.
Proximity-Based Techniques:
It is often possible to define a proximity measure between objects, and a number of
outlier detection approaches are based on proximities. Outliers are those that are dis-
tant from most of the other objects. Many of the techniques in this area are based on
18
distances and are referred to as distance-based outlier detection techniques. When
the data can be displayed as a two- or three-dimensional scatter plot, distance based
outliers can be detected visually, by looking for points that are separated from most
other points.
Density-Based Techniques:
Estimates of the density of objects are relatively straightforward to compute, especially
if a proximity measure between objects is available. Objects that are in regions of low
density are relatively distant from their neighbours, and can be considered outliers. A
more sophisticated approach accommodates the fact that data sets can have regions
of widely differing densities, and classifies a point as an outlier only if it has a local
density significantly less than that of most of its neighbours.
3.7. Predictive Models
Predictive analytic refers to using historical data, machine learning, and artificial intelligence
to predict what will happen in the future. This historical data is fed into a mathematical model
that considers key trends and patterns in the data. The model is then applied to current data to
predict what will happen next. In mathematical modelling, a real-world problem is described
using mathematics. The real-world problem is often vague. The first step is to recognise real
world problem then collect and plot data. Statistics and curve fitting tools can be used to
explore relationships among data. Linear and nonlinear regression models, classification,
clustering, and surface fitting tools can also be used. Dynamic models that allow to express
the effect of a system’s past experiences on its current and future behaviour can be modelled
using neural networks and system identification techniques.
The mathematical model achieved can be validated using collected data. If proper output is
not observed, mathematical models should be revisited. The final model is used to predict
future climate patterns. Finally, data achieved from different sources such as vehicles, drones
and stationary stations are compared.
19
3.7.1. ARIMA and seasonal ARIMA models
Auto-regressive integrated moving average (ARIMA) model has been used for predicting the
future values of the weather data for forecasting. ARIMA models are defined by three pa-
rameters namely p, d and q thus denoting the model as ARIMA(p,d,q) where the parameters
p, d and q are non-negative integers with respective descriptions as below:
p : Lag order, i.e. the number of time lag observations that are included in the model,
thus also called lag order.
d : Degree of differencing, i.e. the number of times the raw observations are differ-
enced
q : Order of moving average, i.e. the number of lagged forecast errors in the prediction
equation
Likewise, seasonal ARIMA models have additional parameters denoting the model as SARIMA
(p,d,q)x(P,D,Q)m where m refers to the number of periods in each season, and uppercase P,
D, Q denote the autoregressive, differencing, and moving average terms for the seasonal part
of the ARIMA model. The best model for forecasting can be found based on the value of
goodness of fit criteria like AIC(Akaike information criterion), BIC(Bayesian information
criterion), etc. which are widely used measures of statistical models. Akaike information
criterion (AIC) is a fined technique that is based on in-sample fit in order to estimate the
likelihood of a model to estimate the future values. The following equation is used to esti-
mate the AIC of a model:
AIC =2ln(L) + 2k(3.26)
where L is the value of the likelihood, N is the number of recorded measurements, and k is
the number of estimated parameters.
The general mathematical formula for an ARIMA model(p,d,q) is:
yt=µ+φ1yt1+... +φpθ1et1... θqetq(3.27)
where y denotes the d th difference of Y, which means:
I f d =0 : yt=Yt(3.28)
20
I f d =1 : yt=YtYt1(3.29)
I f d =2 : yt= (YtYt1)(Yt1Yt2) = Yt2Yt1+Yt2(3.30)
3.7.2. Auto correlations
Auto-correlation Function (ACF) refers to how much correlation exists between a time series
and its past values. So ACF is the plot that is used to see the correlation that exists between
the points in a data, up to and including the lag unit. The correlation coefficient is plotted in
the x-axis with the number of lags shown in the y-axis. It can be used to determine the AR
coefficient required for building an auto-regressive model for a data. In an auto-regressive
model, for determining the value of p i.e. the AR coefficient suitable for the data, the auto-
correlation plot of the data is required. The value of lag at which the auto-correlation is
found to be significant is used as the AR coefficient of the model(see fig 7.20).
When computing auto-correlation (see equation 3.31), the resulting output can range from
1 to negative 1, in line with the traditional correlation statistic. An auto-correlation of +1
represents a perfect positive correlation (an increase seen in one time series leads to a pro-
portionate increase in the other time series). An auto-correlation of negative 1, on the other
hand, represents perfect negative correlation (an increase seen in one time series results in a
proportionate decrease in the other time series). Auto-correlation measures linear relation-
ships; even if the auto-correlation is minuscule, there may still be a nonlinear relationship
between a time series and a lagged version of itself.
ρk=γk
γ0
=covariance at lag K
variance =(YtY)(Yt+kY)
(YtY)2(3.31)
3.7.3. Time series modelling using regression
Another approach in forecasting the meteorological time series involves fitting regression
models (RM) to time series including trend and seasonality components. The RM models
are originally based on linear modelling, but they also allow parameters such as trend and
season to be added to the data. The Recurrent Neural Network Architecture is a natural
generalisation of feed forward neural networks to sequences, RNNs are networks with loops
in them, which results in information persistence. Neural networks like recurrent neural
networks are able to almost seamlessly model problems with multiple input variables.
21
This is a great benefit in time series forecasting, where classical linear methods can be dif-
ficult to adapt to multivariate or multiple input forecasting problems.Recurrent neural net-
works can remember the state of an input from previous time-steps which helps it to take a
decision for the future time-step. RNNs are pretty good at extracting patterns in input feature
space, where the input data spans over long sequences[24]. Given a sequence of inputs (x1,
x2, ..., xN ), a standard RNN computes a sequence of outputs (y1, y2, ..., y3) by iterating
over the following equation :
ht=sigm(Whxxt+Whhht1)(3.32)
yt=Wyhht(3.33)
Figure 3.2: A typical RNN cell. The images on the right are the same layer unrolled in time
where the outputs are fed back into the hidden layer
Figure 3.3: Overview of feature extraction model and forecast model using LSTM
architecture
LSTM
Long short-term memory is a gated memory unit for neural networks. It has 3 gates that
22
manage the contents of the memory. These gates are simple logistic functions of weighted
sums, where the weights might be learnt by back propagation. The input gate and the forget
gate manage the cell state, which is the long-term memory. The output gate produces the
output vector or hidden state, which is the memory focused for use (see fig 3.4). This memory
system enables the network to remember for a long time, which was badly missing from
vanilla recurrent neural networks.
it=sigm(Wixt+Uiht1+bi)(3.34)
ft=sigm(Wfxt+Ufht1+bf)(3.35)
ot=sigm(Woxt+Uoht1+bo)(3.36)
ct=ftct1+ittanh(Wcxt+Ucht1+bc)(3.37)
ht=ottanh(ct)(3.38)
Figure 3.4: LSTM architecture that takes in two inputs, output from the last hidden state
and observation at time = t. Besides the hidden state, there is no information about the past
to remember.
The RNN can map sequences to sequences whenever the alignment between the inputs and
the outputs is known ahead of time. When making a forecast, time series data is first provided
to the auto-encoders, which is compressed to multiple feature vectors that are averaged and
concatenated. The feature vectors are then provided as input to the forecast model in order
to make a prediction. A multivariate time series as input to the auto-encoder will result in
multiple encoded vectors (one for each series) that could be concatenated. The new input
can be concatenated to resulting vectors of encoders for forecasting (see fig 3.3).
23
4. METHODOLOGY
4.1. Software Development Approach
The developed system, being huge and dynamic nature, required an efficient. and timely
development which could not be procured with the traditional development approaches like
waterfall. Thus to meet the requirements of the system and ensuring the timely delivery and
adaptability for changing requirement, Design thinking combined with Scrum methodology
under the Agile Development method has been chosen for the development of system. Since
design thinking and scrum are complementary agile approaches, their combination can prove
to be fruitful for projects where ideation is done during the design thinking and iterative
building through scrum.
Figure 4.1: Design thinking with scrum software development cycle
In Design Thinking, the phase of planning what is to be designed is broken down into three
stages: Empathise, Define and Ideate.
In Design Thinking,
The first empathy stage is spent to gain deeper insights on the users’ wants and needs
and thus the system’s requirements.
In the second define stage, the findings from the empathy phase are externalised and
then the problem set is re-examined. Thus the problem is re-framed with a problem
statement.
The third phase is an ideation stage, where brainstorming is done after understanding
the requirements and defining the problem statement in earlier stages.
24
Scrum is an agile way to manage project. Agile software development with scrum is a
framework for managing processes. In scrum methodology, instead of providing complete,
detailed descriptions of how everything is to be done on a project,much of it is left up to
Scrum software development team since the team will have much knowledge of how the
problem they are presented with can be solved. It relies on a self-organizing, cross-functional
team.
The scrum team is self-organizing in that there is no overall team leader who decides which
person will do which task or how a problem will be solved. This issue is decided by the team
as a whole. Besides since the team is cross functional, everyone is needed to take a feature
from idea to implementation.
This model suggests that projects progress via a series of sprints. In keeping with an agile
methodology, sprints are time boxed to no more than a month long, most commonly two
weeks. Meetings are planned at the start of the sprint, where the team members are supposed
to figure out how many items It advocates for planning meeting at the start of the sprint,
where team members figure out how many items they can commit to and thus create a sprint
backlog, which is a list of tasks to be performed during the sprint.Durint a scrum sprint, the
scrum team selects the set of features and transforms it from idea to coded form and then
into a tested functionality. Thus at the end of the sprint, the features are coded, tested and
integrated into an evolving system.
On each day within the sprint, scrum meetings are held where the members attend the meet-
ing and the works performed by the members on the prior day are shared and also what
works they will perform on that day. Also the hinderances and impediments that may come
in the way of progress. The routine scrums are a way to synchronize and update the works
the team performs since they are discussed.
At the end of a sprint, sprint review is held, during which new functionalities are demon-
strated to the stakeholders who then provide feedback and it is implemented in the next
sprint. The feedback loop within Scrum software development may result in changes to the
freshly delivered functionality, but it may just as likely result in revising or adding items to
the product backlog.
The design thinking stage in this project involved the empathy phase where the system’s re-
quirement was understood properly. The next phase was the define phase where the problem
was reframed and the right problem was defined and finally in the ideation phase, brain-
storming was done where several ideas were generated with different radical ways to meet
the need of the system. During the design thinking, the team explored the possibilities of
various technologies that could be implemented for the solving of the problem statement.
25
Figure 4.2: Scrum software development cycle
The design thinking was followed by scrum approach for iterative building of the project.
Small repeating cycles with short-term planning and constant feedback, inspection and adap-
tion have been used for the development of the system. The scrum meeting was conducted
regularly in a weekly basis, and hence the progress in the project was identified. The progress
of the works were discussed in the meetings and further works to be performed in the splint
were also discussed. Tasks were assigned to be completed in the splint and the features
were prioritised to implement in each splint. Whenever some bug was found relating to the
feature, it was dealt immediately before marking the feature complete.
Each meeting looked for the answers such as what the members have done since last meeting,
what the problems are and what the members will do until next meeting.This was repeated
and hence a cycle was continued in this way until the system was completely developed.
Periodic presentations were held by the team for periodic progress review and further plan-
ning and discussion of the project’s development and implementation. Occasionally meet-
ings were held between the team and concerned researchers and experts to discuss the further
possibilities of enhancements in the system.
4.2. System Block Diagram
The system methodology can be divided into two parts. First part consists of data retrieval
and data visualisation and the second part consists of classification, clustering and modelling
past data and predicting future climate patterns.
26
Figure 4.3: System block diagram
Figure 4.4: Project methodology
27
Figure 4.5: Visualisation work-flow
28
4.3. Data Collection and Visualisation
The overall system can be broadly divided into two parts. First part consists of data retrieval
and data visualisation whereas the second part consists of modelling past data and predicting
future climate patterns (see Fig. 4.6). Data Collection consists of series of steps which are
briefly described below:
Figure 4.6: Overall methodology from remote data collection to visualisation
4.3.1. Data retrieval
Data is collected with the help of mobile sensors fitted on vehicles and high-altitude drones.
Sensor instruments such as air temperature sensor, air pressure sensor, humidity sensor, solar
radiation sensor, rain gauge, air particulate sensor, air pollutant sensor along with other hard-
ware components will be used for data collection. Sensor network in vehicles gather these
data in a periodic interval of time of various places by traversing its route. Data loggers can
be used to record data over time or in relation to location. Data are then sent to the GSM
base stations and further forwarded to the remote database server. Devices’ geographical
position is obtained using a GPS module. Commercial drones also can be used in similar
fashion for attaching external sensors and relaying data. In addition to vehicular networks,
environmental data can also be collected from stationary weather stations.
29
Figure 4.7: Overview of data collected from our sensors in logarithmic scale
4.3.2. Data cleaning
Obtained data is checked for quality. Data is checked to contain duplicates, errors, highly
correlated data, missing fields and cleansing operations are performed. Other operations on
data are also performed like sampling to reduce the complexity of rendering data in front-end
of web application. Also, basic visualisations can be performed during the pre-processing
steprs to get familiarised with the nature of data.
Auto data cleaning methods are implemented that covers discarding missing, null and dupli-
cate data to checking for auto relation of various attributes. Use of physical remote sensors
adds the probability of failure in accurate data collection. Simple cases like battery failure in
the sensor hinders the data collection process and technical failures like unable to connect to
GSM network for a long time might result in buffer memory full condition in the sensors.
30
4.3.3. Data aggregation
To aggregate structured data from one or more sources so that it can be compared and anal-
ysed for greater data intelligence, Big Data technology has been used to store collected data
to give a long-range view of data over time and to analyse data from multiple sources. Big
data is 21st-century phenomenon of exponential growth of business data, and the challenges
that come with it characterised by at least one, but usually all, of the following characteris-
tics: massive volume, high velocity (rate of change), and widely varied type[25].
4.3.4. Data security
Data security refers to protective digital privacy measures that are applied to prevent unau-
thorised access to computers, databases and websites. To prevent unauthorised access of
sensor data over public network passed in the form of web services, encryption of data can
be done.
The use of encryption to remote data collection through GSM based network added security
layer over the Big Data analytic process making it more secure, as it cannot be forged due
to the network architecture and valuable, meaning it is structured, abundant and complete,
making it a perfect source for further analysis. ECC (Elliptic Curve Cryptography) [26]
has been implemented on the sensors that encrypts the data while sending to our remote
server. The use of ECC has been preferred for our system due to its Smaller keys, cipher
texts and signatures. With the use of ECC, we can use smaller keys for the same level of
security, especially at high levels of security. Also, some other advantages like very fast key
generation, fast signatures occur with the use of ECC.
4.3.5. Data endpoints
Adapter, a layer of software that converts the data from an application into a common form
acceptable for integration with another application is used to clone data from remote database
server to a local database server. Database server contains database designed for proper
storage of data in its best suited form. An application programming interface (API) is a set
of subroutine definitions, protocols, and tools for building application software. Data API
endpoints are made available from servers(see fig 4.8). Use of API in extracting data from
local server boosts flow of data.
31
Figure 4.8: Output from our API endpoint
4.3.6. Data visualisation
The obtained data needs to be first analysed by systematically applying statistical and/or
logical techniques to describe and illustrate, condense and recap, and evaluate data. Data
visualisation is the graphical representation of information and data. By using visual ele-
ments like charts, graphs, and maps, data visualisation tools provide an accessible way to see
and understand trends, outliers, and patterns in data. In the world of Big Data, data visual-
isation tools and technologies are essential to analyse massive amounts of information and
make data-driven decisions. Thus, data are represented in graphical forms which are to be
rendered using web application (see fig 4.9).
Since the nature of data in the project is spatio-temporal, visualisation is mainly based on
map based visualisation. The parameters involved in the data is shown in the map. Various
visualisation methods like heat-map, poly-lines, markers, circles are used on the top of Map
to enhance the visualisation. Animation feature is shown in the map to give a short overview
of the data.
4.4. Data Analysis
Data analysis is the process of evaluating data using analytical and statistical tools to discover
useful information and aid in business decision making.Data analytical techniques can reveal
trends and metrics that would otherwise be lost in the mass of information. This information
can then be used to optimise processes to increase the overall efficiency of a system.
32
Figure 4.9: Integrated tool used for data visualisation and analysis.
33
To generate insights on varying environmental and pollution data, analysis is performed
to process enormous volumes of complex data, establish correlations when required and
provide near real-time mapping analytic. The data being collected in the near-real time on
the central database after automatic cleaning and aggregation can be accessed through API
endpoints. Various techniques can applied to the data and the output can be accessed in the
similar fashion.
Data analysis provides general understanding of pattern associated with data parameters.
Some of the techniques used for the analysis of data used in the project are: Classification,
Clustering, Artificial Neural Network and Time Series Forecast Modelling.
4.4.1. Classification
Classification, which is the task of assigning objects to one of several predefined categories,
is a pervasive problem that encompasses many diverse applications. A classification model
can serve as an explanatory tool to distinguish between objects of different classes.
Figure 4.10: Classification for mapping an input attribute set x into its class label y.
Classification is the task of learning a target function f that maps each attribute set x to one of
the predefined class labels y (see Figure 4.10). The target function is also known informally
as a classification model. The target function is also known informally as a classification
model. As classification algorithms, we have used Decision Tree Classifier, Random Forest
Classifier, Gradient Boosted Trees Classifier and Support Vector Machine Classifier.
4.4.2. Regression analysis
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilised
to assess the strength of the relationship between variables and for modelling the future
relationship between them.Regression analysis includes several variations, such as linear,
multiple linear, and nonlinear. The most common models are simple linear and multiple
linear. Nonlinear regression analysis is commonly used for complicated data sets as used in
this project in which the dependent and independent variables show a nonlinear relationship.
34
Simple linear regression is a model that assesses the relationship between a dependent vari-
able and one independent variable. The simple linear model is expressed using the equation
4.1. Multiple linear regression analysis is essentially similar to the simple linear model, with
the exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is given by equation 4.2
Y=a+bX +ε(4.1)
Y=a+bX1+cX2+dX3+ε(4.2)
Nonlinear regression is a form of regression analysis in which observational data are mod-
elled by a function which is a nonlinear combination of the model parameters and depends
on one or more independent variables. The data are fitted by a method of successive approx-
imations.
4.4.3. Clustering
Cluster analysis is referred to as unsupervised Classification. Cluster analysis divides data
into groups (clusters) that are meaningful, useful, or both(see Figure 4.11). If meaningful
groups are the goal, then the clusters should capture the natural structure of the data. In some
cases, however, cluster analysis is only a useful starting point for other purposes, such as
data summarising. Cluster analysis is related to other techniques that are used to divide data
objects into groups. For instance, clustering can be regarded as a form of classification in that
it creates a labelling of objects with class (cluster) labels. However, it derives these labels
only from the data. Clustering algorithms like k-means clustering and x-means clustering
are used for data analysis purpose.
4.4.4. Artificial neural network
The study of artificial neural networks (ANN) was inspired by attempts to simulate biolog-
ical neural systems. Analogous to human brain structure, an ANN is composed of an in-
terconnected assembly of nodes and directed links (see Figure 4.12). ANNs are considered
nonlinear statistical data modelling tools where the complex relationships between inputs
and outputs are modelled or patterns are found.
35
Figure 4.11: Clustering of data-points
ANN can be of several types: Feed-forward Neural Network, Radial Basis Function Neural
Network, Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), Recurrent
Neural Network (RNN) Long Short Term Memory(LSTM), Modular Neural Network,
Deep Learning and Sequence-To-Sequence Models.
Figure 4.12: Basic structure of artificial neural network
36
4.4.5. Time series forecast modelling
A time series is a series of data points indexed in time order. Most commonly, a time series is
a sequence taken at successive equally spaced points in time. Time series analysis comprises
methods for analysing time series data in order to extract meaningful statistics and other
characteristics of the data. Time series forecasting is the use of a model to predict future
values based on previously observed values.
Figure 4.13: An example of time series forecasting
In time series forecasting, we try to predict a particular variable as a function of time(see
Figure 4.13). We assume a time series is made up of 3 components: Trend, seasonality and
Randomness. Trend can be said to give you an insight of whether your data has a pronounced
increasing or decreasing trend. Seasonality of the time series is a pattern in the data that
is repeating over a period of time. Randomness is the variation in the data that Trend and
Seasonality cannot explain. For the purpose of time series forecasting, ARIMA and Seasonal
ARIMA modelling have been used for the project.
37
5. SYSTEM DESIGN
5.1. Requirement Specification
The functional and non-functional requirements are:
5.1.1. Functional requirements
Functional requirements specifies a function that a system or system component must be
able to perform. The functional requirement specification of the project are mainly cate-
gorised as user requirements, security requirements, and device requirement each of which
are explained in detail below:
1. System Requirements:
Sensors Requirements: Detail Technical Specification
WIND SPEED SENSOR:
3 Levels l sensor (primary and redundant at each)
30m and 50 m level and primary sensor at 20 m height)
Sensor: 3 cup rotor polycarbonate
Range: up to 75 m/sec
Accuracy: ±0.3m/sec (= 10m/sec)
Resolution: 0.8 m/s or better
Distance constant: 0.3 m/sec
Cup diameter (approx.): 60 mm or less
Power supply: 1.5-5V DC
Sensor Type: Hall Effect sensor (A3141) with 3 cup rotor
AIR TEMPERATURE SENSOR :
Range: -20 C to + 60 C
Accuracy: ±0.2 C
Radiation Shield: Non Aspirated Radiation
Resolution in degree: 0.1 C
Power supply: 1.5-5V DC
Material: Conducting epoxy casing
Sensor Type: DHT 11 Humidity &Temperature sensor
38
AIR PRESSURE SENSOR:
Sensor: Absolute Pressure Sensor
Range: 15 kPa 115 kPa
Output: Analog (or Digital with SCM)
Resolution: Absolute Pressure in kPa = (Voltage x 21.79) + 10.55 typical
Accuracy: 1.5 kPa (15 mb) max.
Uncorrected offset (+/- 0.443 inches Hg)
Power Supply: 3 V to 35 V
Enclosure: Weather Proof
Sensor type: absolute pressure sensor BP-20
RELATIVE HUMIDITY SENSOR:
Sensor: Relative/Absolute Humidity Range: 0 to 100 %
Accuracy: ±2%(0–90%)
Resolution: 0.7%
Radiation Shield: Non Aspirated Radiation Shield
Output: Analog (or Digital with SCM)
Power supply: 3 35 5 V DC
Sensor Type: DHT 11 Humidity &Temperature sensor
SOLAR RADIATION :
Sensor: Solar Radiation
Spectral response: 0.3 - 3 microns
Operating temperature: - 10 - 50o C
Shield: Weatherproof
Sensitivity/output: 0.1 m/mw/cm2
Range: 0 - 2 kW/m2
Wave Length: 0.3 2.9 µm
Resolution: 0.1 W/m2
Sensor type: High-stability silicon photovoltaic detector (blue enhanced).
RAIN GAUGE / VOLUME DISPLAY
0 9999 mm
Resolution: 0.3mm (if rain volume is less than 1000mm)
1mm (if rain volume is greater thah 1000mm)
Sensor Type: Tipping Bucket Rain Gauge with Bounce-Free Reed Con-
tact
39
Database Requirements: The database used in the system is PostgreSQL. The
minimum production requirements for PostgreSQL to be used are as follows:
(a) 64 bit CPU is required.
(b) 64bit Operating System is required.
(c) A minimum of 2 Gigabytes of memory is required.
(d) Dual CPU/Core is recommended.
(e) RAID 1 is recommended at the least.
Server Requirements:
Since Hadoop is run on the server, the minimum prerequisites required to run
Hadoop are the prerequisites for the server. The server to be used requires Ubuntu
in it. The requirements of the server are as follows:
(a) Hardware requirement: The machine requires a minimum of 4GB RAM and
minimum 60 GB hard disk for better performance. 16GB or more of RAM
for systems over 100,000 tags. A minimum of 3GHz dual or quad core pro-
cessors is recommended. 100MB/1GB Ethernet is required for networking.
Fast Solid State Drives (SSD) are required when using RAID drives. A min-
imum of 100MB/1GB Ethernet is required for networking. 64-bit versions
of Windows OS is required for all systems.
(b) Software requirement:
The server requires Ubuntu OS installed in it.
Java needs to be installed. It is recommended to install Oracle Java 8.
APIs and drivers:
An API used in the system is Django Rest API. The driver used in the system
includes psycopg2.
Django REST framework requires the following:
(a) Python version of 3.5+(3.5, 3.6, 3.7) is recommended.
(b) Django version of 2.0+(2.0, 2.1, 2.2) is recommended.
The following packages are optional:
coreapi (1.32.0+) - Schema generation support. Markdown (3.0.0+) - Mark-
down support for the browsable API. Pygments (2.4.0+) - Add syntax highlight-
ing to Markdown processing. django-filter (1.0.1+) - Filtering support. django-
guardian (1.1.1+) - Object level permissions support.
40
The psychopg2 implementation is supported by:
(a) Python version 2.7
(b) Python 3 versions from 3.4 to 3.7
(c) PostgreSQL server versions from 7.4 to 11
(d) PostgreSQL client library version from 9.1
If psycopg2 is not compiled as a static library, it also requires the libpq library at
runtime that is usually distributed in a libpq.so or libpq.dll file. psycopg2 requires
the library whose location needs to be provided if installed in a non-standard
location.
The supported, tested versions of Django-MySQL’s requirements are as follows:
(a) Python: 3.5+
(b) Django: 2.0+
(c) MySQL: 5.6+ / MariaDB: 10.0+
(d) mysqlclient: 1.3
Web browser:
The web browser needs to support javascript, css and AJAX. All Major browsers
supports AJAX. ECMAScript 2015 partially supported in all these browsers. The
browsers supporting AJAX in desktop with their versions are as follows:
(a) Internet Explorer 10+(2012)
(b) Chrome 23+(2012)
(c) Safari 6+(2012)
(d) Firefox 21+(2013)
(e) Opera 15+(2013)
The browsers supporting AJAX in mobile devices with their versions are as fol-
lows:
(a) Stock browser on Android 4.0+
(b) Safari on iOS 7+
2. User Requirements:
The system shall develop a visualisation tool with the help of maps, charts, tables
to visualise the spatio-temporal weather and pollution data.
41
The system shall provide the users with an overview of the spatio-temporal data
with an animation feature.
The system shall provide the users with heat maps in order to view the highly
affected regions with regards to high value of pollution and weather parameters.
The system shall allow the users to view the data in different region according to
different parameters and in different dates.
The system shall provide special privilege for viewing the data.
5.1.2. Non-functional requirements
The non-functional requirements of the system can be summarised as follows:
Performance: The system shall have a quick, accurate and reliable results. Since the
system is near real-time, the system should follow its time constraint while providing
the response to the users.
Capacity and Scalability: The system shall be able to scale with increase in data.
The system requires a big data architecture so that increase in number of sensors and
incoming data does not hinder its performance.
Availability: The system shall be available to user anytime whenever there is an Inter-
net connection.
Recovery: In case of malfunctioning or unavailability of server, the system should be
able to recover and prevent any data loss or redundancy.
Flexibility and Portability: System shall be accessible anytime from any locations.
5.2. Feasibility Assessment
Feasibility assessment is done to analyse the viability of project idea. Any project starts with
the analysis of problem statement and to determine if the project can effectively solve the
problem, feasibility assessment is done. On the basis of analysis, the decision is take whether
to proceed, postpone or cancel the project. Our project consists of various of components.The
analysis of feasibility of the project is required to know the project can be scalable or not.Also
the decision whether to drop the project or redesign can also be taken. There are five areas
of feasibility - Technical, Economic, Legal, Operational and Scheduling.
42
5.2.1. Operational feasibility
The operational feasibility analysis gives the description about how the system operates and
what resources do the system requires for performing its designated task. It is also a measure
of how well a system solves the problems, and takes advantage of the opportunities identified
during scope definition and how it satisfies the requirements identified in the requirements
analysis phase of system development. The system runs in a website accessing data from
Database server. Sensors used to collect the data are running continuously.
Hence, the additional data can be appended over the previous data. The initial expectation
of data is low compared to later situation. As the volume of data increases in the system,
we will be ready to switch to previously tested big data architecture for the system. The
resources selected for the system is optimal in case of cost as well as performance.
System is designed to be operated in the browser environment, and hence eliminates the
difficulty of installation in every computer and other devices from which user is trying to use
the system. The system can be operated with the resource as of personal computer i.e browser
and even low end mobile phones capable of running web browsers and rendering Java-scripts.
The project is developed as the website allows the easy access to the multiple users. System
is carefully designed to make it possible to be operable in most of the environment, hence
the project can be considered operationally feasible.
5.2.2. Technical feasibility
Technical Feasibility Assessment examines whether the proposed system can actually be de-
signed to solve the desired problems and requirements using available technologies in the
given problem domain. The system is said to be feasible technically if it can be deploy-
able, operable and manageable under current technological context of global market. Vari-
ous factors are associated with assessment of technical feasibility such as right selection of
technology type, use of standard technology and familiarity of project team members with
technology used.
The sensors used for the project are easily available in the market. For the accurate and
precise measurement of data from sensor, the sensor is calibrated as well. For storage of data,
the open source database system handle a range of workloads is selected. Also, considering
the huge amount of data collected from sensors, the big data architecture implemented is also
tested for the system.
43
Data visualisation helps transform data such as trends and insights in a visual manner, which
can completely change the impact of the same information when presented numerically.
Visualisation tools are implemented by the use of combination of open-source components.
Also, the map based visualisation is focused considering the spatio-temporal nature of data.
For the clarification of map based visualisation, charts and tables are also selected to form
a complete visualisation for the system. Visualisation components used in the system are
easily available in the internet. Hence, the project can be considered technically feasible.
5.2.3. Economic feasibility
Economic Feasibility checks whether the cost required for complete system development is
feasible using at the available resources in hand. It should be noted that the cost of resources
and overall cost of deployment of system should be kept minimum while operational and
maintenance cost for the system should be within the capacity of organisation. Since the
system is hosted on server easily available in the market, the map system used for the visu-
alisation is open source and the sensors implemented are optimum in cost and performance,
the system can be considered economically feasible for the development.
5.2.4. Legal feasibility
Legal Feasibility assessment checks the system for any conflicts with legal requirement,
regulations that are to be followed as per the standard maintained by the governing body.
As such, the system that is being developed must comply with all the legal boundaries such
as copyright violation, authorise use of licenses and other. This prevent any future conflicts
for the system and also provide legal basis for the system in future if any other tries to use
part of or full system without necessary permission and documents. The data obtained from
the social media of the user is being consented by the user and doesn’t does violate any other
obligation of law and privacy, project can be considered legally feasible.
5.2.5. Scheduling feasibility
Any project is considered fail if it is not completed on time. so, Scheduling Feasibility
estimates the time require fro the system to fully develop and whether that time feasible or
not according to current trend in market. If the project takes longer time to complete, it may
be outdated or some other may launch the similar system before our system is complete.
44
So, it is required to fix the deadline for any project and the system should be out and operative
before specified deadline. As the scheduling of the project is in consistent with the available
time of the project, project can be considered scheduling feasible.
5.3. Use Case Diagram
Figure 5.1: Use-case diagram for web visualisation
45
5.4. Activity Diagram
Figure 5.2: Activity diagram for web visualisation
46
5.5. Class Diagram for System
Figure 5.3: Class diagram for system
47
5.6. Class Diagram for Data
Figure 5.4: Class diagram for data
48
5.7. Database Schema
Figure 5.5: Database schema
49
5.8. Sequence Diagram
Figure 5.6: Sequence diagram
50
5.9. Communication Diagram
Figure 5.7: Communication diagram
51
5.10. Data Flow Diagram
Figure 5.8: Data flow diagram for web visualisation
52
5.11. Deployment Diagram
Figure 5.9: Deployment diagram for web visualisation
53
6. TOOLS AND TECHONOLOGIES
6.1. Python
Python is a widely used high-level programming language for general-purpose program-
ming. Python features a dynamic type system and automatic memory management and
supports multiple programming paradigms, including object-oriented, imperative, functional
programming, and procedural styles. It has a large and comprehensive standard library.
Python is one of those rare languages which can claim to be both simple and powerful. In
Python, it is quite surprising to see how easy it is to concentrate on the solution to the problem
rather than the syntax and structure of the language you are programming in.
6.2. Django
Django is a free and open-source web framework, written in Python, which follows the
model-view-template (MVT) architectural pattern. Django’s primary goal is to ease the cre-
ation of complex, database-driven websites. Django emphasizes reusability and ”pluggabil-
ity” of components, rapid development, and the principle of don’t repeat yourself. Python is
used throughout, even for settings files and data models. Django also provides an optional
administrative create, read, update and delete interface that is generated dynamically through
introspection and configured via admin models.
6.3. NumPy
NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.Besides its obvious scientific uses, NumPy can also be
used as an efficient multi-dimensional container of generic data. Arbitrary data-types can
be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety
of databases. NumPy is the fundamental package for scientific computing with Python. It
contains among other things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
54
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
6.4. Pandas
Pandas is a software library written for the Python programming language for data manip-
ulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. It offers wide range of features including DataFrame object
for data manipulation with integrated indexing, tools for reading and writing data between
in-memory data structures and different file formats, data alignment and integrated handling
of missing data and more.
6.5. HTML/CSS
Hypertext Markup Language (HTML) is the standard markup language for creating web
pages and web applications. Web browsers receive HTML documents from a web server
or from local storage and render them into multimedia web pages. HTML describes the
structure of a web page semantically and originally included cues for the appearance of the
document. Cascading Style Sheets (CSS) is a style sheet language used for describing the
presentation of a document written in a markup language. It is most often used to set the
visual style of web pages and user interfaces written in HTML.
6.6. Javascript
JavaScript (JS) is a high-level, dynamic, weakly typed, object-based, multi-paradigm, and
interpreted programming language. Alongside HTML and CSS, JavaScript is one of the
three core technologies of World Wide Web content production. It is used to make web
pages interactive.
6.7. PostgreSQL
PostgreSQL is an object-relational database management system (ORDBMS) with an em-
phasis on extensibility and standards compliance. As a database server, its primary functions
are to store data securely and return that data in response to requests from other software ap-
plications. It can handle workloads ranging from small single-machine applications to large
55
Internet-facing applications (or for data warehousing) with many concurrent users. Post-
greSQL is ACID-compliant and transactional. PostgreSQL has updatable views and mate-
rialized views, triggers, foreign keys; supports functions and stored procedures, and other
expandability.
6.8. Git
Git is a version control system (VCS) for tracking changes in computer files and coordinating
work on those files among multiple people. It is primarily used for source code management
in software development, but it can be used to keep track of changes in any set of files.
As a distributed revision control system it is aimed at speed, data integrity, and support for
distributed, non-linear workflows.
6.9. Leaflet
Leaflet is the leading open-source JavaScript library for mobile-friendly interactive maps.
Leaflet is designed with simplicity, performance and usability in mind. It works efficiently
across all major desktop and mobile platforms, can be extended with lots of plugins, has a
beautiful, easy to use and well-documented API and a simple, readable source code that is a
joy to contribute to.
6.10. Rapidminer
RapidMiner is a data science software platform developed by the company of the same name
that provides an integrated environment for data preparation, machine learning, deep learn-
ing, text mining, and predictive analytics. It is used for business and commercial applications
as well as for research, education, training, rapid prototyping, and application development
and supports all steps of the machine learning process including data preparation, results
visualization, model validation and optimization.RapidMiner is developed on an open core
model.
6.11. Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitate using a net-
work of many computers to solve problems involving massive amounts of data and compu-
56
tation. It provides a software framework for distributed storage and processing of big data
using the MapReduce programming model.
Modules The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFSTM): A distributed file system that provides
high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop is used for storing and processing big data. In Hadoop data is stored on inexpensive
commodity servers that run as clusters. It is a distributed file system allows concurrent
processing and fault tolerance. Hadoop MapReduce programming model is used for faster
storage and retrieval of data from its nodes.
Hadoop is a Java based framework, it is an open-source framework. It is developed by Doug
Cutting and Michale J. It is managed by Apache Software Foundation and licensed under the
Apache License 2.0
6.12. Web Socket
The WebSocket protocol is a core technology of modern, real-time web applications. It pro-
vides a bidirectional channel for delivering data between clients and servers. It gives you the
flexibility of a TCP connection with the additional security model and meta data built into the
HTTP protocol. Unlike HTTP, WebSocket provides full-duplex communication. Addition-
ally, WebSocket enables streams of messages on top of TCP. TCP alone deals with streams
of bytes with no inherent concept of a message. Before WebSocket, port 80 full-duplex
communication was attainable using Comet channels; however, Comet implementation is
nontrivial, and due to the TCP handshake and HTTP header overhead, it is inefficient for
small messages. The WebSocket protocol aims to solve these problems without compromis-
ing security assumptions of the web.
The WebSocket protocol specification defines ’ws’ (WebSocket) and ’wss’ (WebSocket Se-
cure) as two new uniform resource identifier (URI) schemes[4] that are used for unencrypted
and encrypted connections, respectively. Apart from the scheme name and fragment, the rest
57
of the URI components are defined to use URI generic syntax. Using browser developer
tools, developers can inspect the WebSocket handshake as well as the WebSocket frames.
6.13. Rest Framework
REST is a shorthand for REpresentational State Transfer. It describes an architecture which
is used for the purpose of web APIs for data communication.
It also supports some of the common HTTP methods to make interaction between the ma-
chines or applications.
Some of the HTTP methods that are commonly used in REST architecture are:
GET - It returns the records or you can say the data that you feed in. So GET is used
to retrieve a resource.
PUT - It is used to change the state, or update a resource, which can be an object, file
or block.
POST - It is used to create that resource.
DELETE - It is used to remove a particular resource.
58
7. RESULTS AND DISCUSSIONS
7.1. Statistical Analysis
7.1.1. Environmental anomalies
The data (Kathmandu 2010 to 2019) is the measurements of weather parameters at stationary
station with sampling rate ranging over three-hour to eight-hour over a period of almost 9
years. However, we discuss only about Temperature variable in this section. The data-
set consists of 15701 rows and 5 columns namely Date, Time, Temperature, Pressure and
Relative Humidity (see table 7.1).
The data-set consists of time-series measurement of weather parameters from 2010-01-02
11:45:00 to 2019-12-06 23:45:00. The separate attributes for date and time has been aggre-
gated to form a new attribute date time to make computations easier (see table 7.2).
Since, the data had already been cleaned, no missing values and errors were found. The new
columns were generated from date time (year, quarter, month, day, weekday) to enable for
RNN to learn variations properly.
Date Time Temperature PressureW(mm of Hg) Rel. Humidity
019-01-10 5:45:00 4.0000 652.1000 98
119-01-10 11:45:00 16.6000 652.6000 58
219-01-10 23:45:00 6.4000 651.5000 87
320-01-10 5:45:00 3.3000 652.7000 99
420-01-10 23:45:00 6.8000 653.0000 100
Table 7.1: Raw data for weather (Kathmandu from 2010 to 2019).
date time Temp. year quarter month day weekday
15696 2019-12-06 11:45:00 28.4000 2019 4 12 6 1
15697 2019-12-06 14:45:00 29.8000 2019 4 12 6 1
15698 2019-12-06 17:45:00 26.6000 2019 4 12 6 1
15699 2019-12-06 20:45:00 22.4000 2019 4 12 6 1
15700 2019-12-06 23:45:00 20.3000 2019 4 12 6 1
Table 7.2: Data for weather (Kathmandu from 2010 to 2019) after removing missing values
and aggregation of date and time into single attribute and generation of new attributes.
59
Several statistical tests that can be used to quantify whether the data looks as though it was
drawn from a Gaussian distribution. D’Agostino’s K2Test was used for this purpose. In the
SciPy implementation of the test, p value was interpreted as follows:
palpha: reject H0, not normal.
p>alpha: fail to reject H0, normal.
where value of alpha was taken to be 0.05.
The result of statistics test was found as follows:
Statistics=1000.359, p=0.000
From above result, we can say that data does not look Gaussian and reject H0.
Kurtosis and skewness were also calculated to determine if the data distribution departs from
the normal distribution. Following result was obtained:
Kurtosis of normal distribution: -0.6712438399432172
Skewness of normal distribution: -0.3353602422999569
Kurtosis describes heaviness of the tails of a distribution. Normal Distribution has a kurtosis
of close to 0. If the kurtosis is greater than zero, then distribution has heavier tails. If the
kurtosis is less than zero, then the distribution is light tails. And our kurtosis is less than
zero. So, our data has light tails. Similarly, if the skewness is between -0.5 and 0.5, the data
are fairly symmetrical. If the skewness is between -1 and 0.5 or between 0.5 and 1, the
data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are
highly skewed. And sice skewness of our data is less than 1, our data is fairly symmetrical
(see figure 7.1).
Figure 7.1: Data distribution and skewness test for temperature
The figure 7.2 shows Time series plot of Temperature from 2010-01-02 11:45:00 to 2019-
12-06 23:45:00. Temperature is found to be increasing and decreasing throughout the year
for every year. When we compare box-plot side by side for each year (see figure 7.3, we
notice that the median Temperature in 2010 and 2016 is slightly higher than the other years’.
60
So these years might have been warmer than other years.
Figure 7.2: Time series plot of temperature
Figure 7.3: Box plots of temperature yearly and quarterly
Temperature is higher in second and third quarter which is summer time and lower in first
and fourth quarter which is winter time. Also the third quarter (July to September) is warmer
than second quarter (April to June) and first quarter (January to March) is colder than fourth
quarter (October to December).
count mean std min 25% 50% 75% max
Temp. 15701 18.2119 6.9298 0.0000 13.0000 19.5000 23.2000 39.0000
Table 7.3: Temperature distribution
61
Figure 7.4: Normal probability plot for distribution of temperature.
62
Various Statistical parameter for temperature data is shown in table 7.3 with mean, standard
deviation, minimum value, first quartile, second quartile, third quartile and maximum tem-
perature to be 18.21 C, 6.93 C, 0 C, 13.00 C, 19.50 C, 23.20 C and 39.00 C respectively.
The minimum temperature of Kathmandu has not decreased below 0 degree Celcius and has
not increased beyond 39 degree Celcius from 2010 to 2019 which is not quite accurate. So
there must have been some errors in measurement of temperature, most likely the offset had
incresed in the sensor value measurement. Normal probability plot also shows the data set is
not so far from normally distributed (see figure 7.4).
Figure 7.5: Average temperature re-sampled over one day and week
Figure 7.6: Average temperature re-sampled over one month, quarter and year.
Average Temperature re-sampled over day, week, month, quarter and year has been shown
in figure 7.5 and 7.6 respectively. In general, our time series has a general upward and
downward trend in cyclical manner. Higher temperatures were recorded in 2010 and lower
in 2019 which in not accurate.
In mean Temperature re-sampled over day, larger fluctuations can be seen with continuous
increasing and decreasing pattern throughout the time-series.
63
The mean temperature re-sampled over week does not have much significance for tempera-
ture attribute. Mean Temperature re-sampled over month as well as quarter also shows some
seasonal trends in temperature throughout the year.
Figure 7.7: Mean temperature grouped by year, quarter, month and day.
Figure 7.8: Temperature by years
Plot of mean Temperature grouped by year, quarter, month and day (see figure 7.7) confirmed
our previous discoveries. Temperature falls to its lowest value in Kathmandu in January and
reach maximum in June. Mean temperature by quarter also confirms previous discovery
that lowest mean temperature occurs in first quarter and highest mean temperature in third
quarter.
Figure 7.8 shows temperature by years, where each year shows similar trend in temperature
except for 2019 where complete data has not been collected yet.
64
p-value 0.0418
#Lags Used 29.0000
Number of Observations Used 3022.0000
Critical Value (1%) -3.4325
Critical Value (5%) -2.8625
Critical Value (10%) -2.5673
Table 7.4: Results of Dickey-Fuller test for temperature
Figure 7.9: Rolling mean and standard deviation
The Dickey–Fuller test tests the null hypothesis that a unit root is present in an auto-regressive
model. The alternative hypothesis is different depending on which version of the test is used,
but is usually stationarity or trend-stationarity. Stationary series has constant mean and vari-
ance over time. Rolling average and the rolling standard deviation of time series do not
change over time.
Null Hypothesis (H0): It suggests the time series has a unit root, meaning it is non-stationary.
It has some time dependent structure.
Alternate Hypothesis (H1): It suggests the time series does not have a unit root, meaning it
is stationary. It does not have time-dependent structure.
p-value >0.05: Accept the null hypothesis (H0), the data has a unit root and is non-stationary.
p-value 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is
stationary.
From the results obtained (see table 7.4), we reject the null hypothesis H0, the data does
not have a unit root and is stationary. Figure 7.9 shows rolling mean and average of our
time-series data.
65
7.1.2. Pollution proxies
DateTime Time Date PM2.5 CO2 Formal. VOCs
011/29/18 2:01 PM 14:01:00 29/11/2018 47.0 0.0 0.0350 0.0000
111/29/18 2:02 PM 14:02:00 29/11/2018 50.0 0.0 0.0320 0.0000
211/29/18 2:03 PM 14:03:00 29/11/2018 47.0 721.0 0.0290 0.2580
311/29/18 2:04 PM 14:04:00 29/11/2018 49.0 672.0 0.0260 0.2850
411/29/18 2:05 PM 14:05:00 29/11/2018 46.0 645.0 0.0260 0.3560
Table 7.5: Raw data for pollution logged by our data logger. (Kathmandu from 2018 to
2019).
The pollution data (Kathmandu 2018 to 2019) is the measurements of pollution parameters
at stationary station with sampling rate of one minute over a period of almost one years.
However, we discuss only about CO2 variable in this section. The data-set consists of 15701
rows and 87394 columns namely Date, Time, PM2.5, CO2, Formaldehyde and VOCs (see
table 7.5).
date time CO2 year quarter month day weekday
87389 2019-12-05 23:55:00 424.0000 2019 4 12 5 1
87390 2019-12-05 23:56:00 427.0000 2019 4 12 5 1
87391 2019-12-05 23:57:00 424.0000 2019 4 12 5 1
87392 2019-12-05 23:58:00 427.0000 2019 4 12 5 1
87393 2019-12-05 23:59:00 426.0000 2019 4 12 5 1
Table 7.6: Data for pollution (Kathmandu from 2018 to 2019) after removing missing
values and aggregation of date and time into single attribute and generation of new
attributes.
The data-set consists of time-series measurement of weather parameters from 2018-01-12
00:00:00 to 2019-12-05 23:59:00. The separate attributes for date and time has been ag-
gregated to form a new attribute date time to make computations easier (see table 7.6).
Since, the data had already been cleaned, no missing values and errors were found. The
new columns were generated from date time (year, quarter, month, day, weekday) to enable
for RNN to learn variations properly. The kurtosis of normal distribution was found to be
106.09 and skewness of normal distribution as 6.44. The kurtosis is greater than zero, so the
data is heavily tailed. Skewness is also greater than one, which implies that the data is highly
skewed (see figure 7.10). The figure 7.11 shows Time series plot of CO2 from 2018-01-12
00:00:00 to 2019-12-05 23:59:00. CO2 is found to be increasing and decreasing throughout
the time-series.
Various Statistical parameter for CO2 data is shown in table 7.7 with mean, standard devia-
tion, minimum value, first quartile, second quartile, third quartile and maximum CO2 level
66
Figure 7.10: Data distribution and skewness test for CO2
Figure 7.11: Time series plot of CO2
count mean std min 25% 50% 75% max
CO2 87394 508.2007 188.7191 0.0 419.0 439.0 505.0 7096.0
Table 7.7: CO2 distribution
Figure 7.12: Normal probability plot for distribution of CO2.
67
to be 508.20, 188.72, 0, 419, 439, 505 and 7096 respectively. The average CO2 level of
Kathmandu was found to be 508.27 which is way beyond as suggested by WHO level with
highest being 7096 during observation period. Normal probability plot also shows the data
set is far from normally distributed (see figure 7.12).
Figure 7.13: Mean CO2 level grouped by year, quarter, month and day.
Figure 7.14: CO2 emission in weekdays and weekends
68
Figure 7.14 shows CO2 level is very high during weekdays and attains lowest value in week-
ends. This confirms with the fact that large number of vehicles and industries run on week-
days compared to that of weekends. From the results obtained (see table 7.8), the p-value is
observed to be 0.71 which is greater than 0.05. So, we are unable to reject the null hypothesis
H0, that the data does not have a unit root and is stationary.
p-value 0.7086
#Lags Used 10.0000
Number of Observations Used 65.0000
Critical Value (1%) -3.5352
Critical Value (5%) -2.9072
Critical Value (10%) -2.5911
Table 7.8: Results of Dickey-Fuller test for temperature.
7.2. Modelling Using RNN
Figure 7.15: Prediction of Temperature variation using RNN (Kathmandu from 2010 to
2019). The graph shows a part of the data set used for training where the solid curve is the
model predicted value after the training and dotted curve is the data points in the original
data set.
Prediction of average temperature for the next time periods based on the weather conditions
and pollution was done using Deep Learning with RNN model. The data is converted to an
appropriate format and then transformed into a supervised learning problem. Dataset is then
split to prepare train and test sets. A moving forward window of size 20 was used, which
means the first 20 data points as input X will be used to predict y1 -21st data point. The
inputs (X) are reshaped into the format expected by the first layer of our model (LSTM).
Then our model is defined and fitted. The neural network architecture was defined with 15
neurons in the first hidden layer (LSTM), 14 neurons in second hidden layer (LSTM) and 1
69
neuron in the output layer (Dense) for prediction with ‘sigmoid’ activation at final layer (see
figure 7.16. Mean Squared Error (MSE) was used for loss function and the efficient Adam
version of stochastic gradient descent for optimisation. Dropout was used for second hidden
layer to prevent over-fitting. The model was fitted for 10 training epochs with a batch size of
256.
Finally, track of both the training and test loss during training was kept by setting the valida-
tion data argument in the fit() function. After the model was fitted, the entire test dataset was
used for forecasting. With forecasts and actual values in their original scale, error score for
the model was calculated. In this case, Root Mean Squared Error (RMSE) was calculated
that gives error in the same units as the variable itself.
A prediction made on test data set is shown in figure 7.15. The model is found to have root
mean square error of 5.760 +/- 0.000 degree Fahrenheit. Some observations made before
selection of model are listed in table 7.9.
SN Neurons in Hidden Layer R2 Score (train) R2 Score (test) RMSE (C)
1 (5,10) 0.34 0.42 9.86
2 (6,14) 0.50 0.58 8.65
3 (8,12) 0.27 0.35 10.53
4 (7,9) 0.41 0.49 9.45
5 (9,16) 0.51 0.6 8.41
6 (14,11) 0.51 0.59 8.48
7 (16,5) 0.28 0.35 10.10
8 (19,19) 0.49 0.57 8.54
9 (15,14) 0.52 0.61 8.37
Table 7.9: Comparison of various structures of neural network
Finally, track of both the training and test loss during training was kept by setting the valida-
tion data argument in the fit() function. After the model was fitted, the entire test dataset was
used for forecasting. With forecasts and actual values in their original scale, error score for
the model was calculated. In this case, Root Mean Squared Error (RMSE) was calculated
that gives error in the same units as the variable itself. A prediction made on test data set
is shown in figure ( 7.15). The model is found to have root mean square error of 5.760 +/-
0.000 degree Fahrenheit.
70
Figure 7.16: LSTM tensor graph used in training. Two LSTM layers are used with a
dropout and a dense layer.
71
7.3. Modelling Using ARIMA
First the statistical properties of the data is studied for observing its pattern and properties
as shown in figure 7.17. The air temperature data is visualised using a method called time-
series decomposition that allows us to decompose the time series data into three distinct
components namely: trend, seasonality and residuals( noise) as shown in figure 7.15. It can
be seen that the air temperature is minimum at the end of the year since December is the
peak winter season in Nepal. This decomposition helps in analysing the trend of the time
series data and thus modelling it.
Figure 7.17: Statistics of the air temperature data studied to observe its skewness, noise and
correlations. Statistics is observed prior to modelling of the data.
Figure 7.18: Decomposition of the time-series air temperature data into trend, seasonality
and noise(residual) with the actual data shown in the first row.
72
Figure 7.19: One-step ahead forecast using continuous training and prediction of
temperature variation using SARIMA(1, 1, 1)(0, 1, 1, 12) on 10 years air temperature data.
The model is selected based on the minimum value of AIC.
Figure 7.20: Plot of Auto-correlation Function obtained from the data. The graph shows the
most significant auto-correlation obtained at a lag of 8.
73
The figure 7.19 shows one-step ahead forecasting of the 10 years long air temperature data
where the data-set consists of time-series measurement of weather parameters from 2010-
01-02 11:45:00 to 2019-12-06 23:45:00. This is done using seasonal ARIMA model. The
model used in this case is ARIMA(1, 1, 1)x(0, 1, 1, 12). This model is selected based on the
minimum value of AIC.
The AIC of the model with predictions shown in figure 7.19 is found to be 297.845. The
Mean Squared Error of the forecast is found to be 2.72 and the Root Mean Squared Error of
our forecasts is 1.65. The mean squared error (MSE) of an estimator measures the average
of the squares of the errors that is, the average squared difference between the estimated
values and what is estimated.
The MSE is a measure of the quality of an estimator—it is always non-negative, and the
smaller the MSE, the closer the model is to finding the line of best fit. The air temperature in
our data ranges from around 0 to 39. The Root Mean Square Error (RMSE) of 1.65 indicates
that the model was able to forecast the air temperature in the test set within 1.65 of the real
air temperature.
Likewise, for the forecasting from a shorter duration time series data, a rolling forecast
ARIMA model is used in our forecasting of air temperature of a month long 3-hourly data
of April 2019. The data-set used for modelling with ARIMA consists of a month long
time-series measurement of weather parameters from 2019-04-01 00:00:00 to 2019-04-29
21:00:00.
From the autocorrelation plot of this data as shown in figure 7.20, it can be seen that max-
imum correlation can be seen at a minimum lag of 8. So the AR parameter of the ARIMA
model, i.e. value of ’p’ has been set to 8.
Similarly, since the data is not stationary, differentiating is required for the data to be sta-
tionary. So ’d’, the degree of differencing has been set to 1 based on the minimum value of
standard deviation that is seen with the variation in d. The value of Moving Average (MA)
coefficient, q, has been set to 1 so that the forecast is adjusted in the direction of the error it
made in previous forecast.
In the rolling forecast of our data , the model used for training is (8,1,1). Rolling forecast
model is used since it has dependence on the observations obtained for the AR model as
well as differencing from the prior time steps. A simple way to perform rolling forecast is to
re-create the ARIMA model after a certain number of new observations are received and re-
training the model with new observations appended to the previous training set. Forecasting
for the next 18 hours was done using a Rolling Forecast ARIMA model.
74
First the hourly data of air temperature was converted to a 3-hourly data such that each day
had 8 data points. Then an ARIMA model was trained to forecast the data of further 6 points.
Then the new observations were again provided to re-create the ARIMA model. The forecast
produced by the model in this case is shown in figure 7.21. Similarly the model was also
used to predict future values without training it continuously and the result of such model is
shown in figure 7.22
Figure 7.21: Continuous Training and Prediction of Temperature variation using ARIMA.
Each window contains data points of a single day. Solid curve is the predicted output from
the model trained using the real data until previous day and the dashed curve represents the
test data.
Figure 7.22: Using ARIMA model to predict Temperature variations. The solid curve is the
predicted output. The same model as in fig. 7.21 is used without continuous training.
75
8. CONCLUSION
In this project we collected, analysed and visualised data from multiple mobile physical
sensors as well as stationary stations. Such raw data are erroneous, have noise, high dimen-
sions and large volume. Thus, it required robust data logging system for which a proper
architecture for distributed storage and data handling was studied and implemented. GPS
data Encryption algorithms has also been explored during transmission over public network
for the purpose of data security and integrity. The concept can also be extended to other
applications like road traffic analysis, city mobility analysis, etc. Those technologies and
applications can be used as a foundation for several other projects.
For spatio-temporal data visualisation, map based visualisation in combination with graphs
and tables is found to be effective. For analysis purpose of those spatio-temporal data, pre-
liminary steps of data mining such as anomaly detection, removal of duplicate data and
outliers and handling of missing data are followed by various supervised and unsupervised
machine learning algorithms. For better performance of data in further algorithmic analysis
like clustering, classification and regression, outliers were detected from the data-set and
those anomalous data points were removed.
Various clustering algorithms such as k-means and x-means were performed. Also, classifi-
cation algorithm such as Deep Learning, Decision Tree, Random Forest, Gradient Boosted
Trees and Support Vector Machine were performed. But for the spatio-temporal weather and
pollution data, results from regression analysis was found more useful than clustering and
classification analysis. The collected preliminary data-set from area around Kathmandu val-
ley are able to map some interesting features and environmental proxies that are visualised
and the patterns and variations in it are explored using various models such as ARIMA, RNN
etc. ARIMA model and its architecture was studied and implemented based on the pattern
and properties of the data. The root mean square error (RMSE) of ARIMA model during the
forecasting of the 10 years air temperature data was found to be 1.65 where the temperature
ranged from 0 to 39 thus providing quite good prediction of the values.
For weather data of Kathmandu (2010 to 2019) neural network architecture was studied and
implemented for temperature prediction. Among various structures of the network, the best
result was obtained for 3 layers of neural network ( first LSTM layer of 15 neurons, second
LSTM layer of 14 neurons and final Dense layer of 1 neuron) for which R2 score for train
data-set, R2 score for test date-set and root mean square error was observed to be 0.52, 0.61
and 8.37 C respectively .
76
9. LIMITATIONS AND FUTURE ENHANCEMENTS
9.1. Limitations
This project is highly planned and acted upon from the beginning. Nevertheless, the project
had to face some of the limitations due to various factors. Different aspects of the projects
such as nature of data, visualisation methods, data storage method and so on have their own
limitations. Some of the limitations faced by the project are:-
1. The sensors used for the purpose of taking weather and pollution data are highly ex-
pensive. Sometimes, the money spent in the sensors don’t even justify the accuracy
and precision of data given by the sensors. Also, increasing number of sensors covers
the entire budget of the project.
2. Although the sensors used in the project have been re-calibrated for the purpose of data
collection in the project, data still contained a number of errors and missing values.
3. The attempts have been made to make the visualisation effective for the users with
the help of map based visualisation along with charts and table. Still, different users
confronted with the same data visualisations may not necessarily draw the same con-
clusion, depending on their previous experiences and particular level of expertise.
4. Although the use of big data architecture has made the access of large amount of
spatio-temporal data easier and faster, lack of resources for extension of nodes in the
architecture has limited us to efficiently utilise the power of big data technology.
5. For creation of model, the use of powerful processing resources is required. They are
expensive and are not easily available.
6. The installation of sensors in the vehicle has hampered the accuracy of data as some of
the vehicle parameters like heat produced by the engine, wind due to high speed and
so on affect the working of the sensor.
7. The data logging technology used in the sensors cause difficulty in examining the
working status of the sensors.
77
9.2. Future Enhancements
It is the nature of projects in the field of computer science and information technology to re-
quire changes and modifications as demand changes and technology advances. This project
will be enhanced in the future for better visualisation of spatio-temporal weather and pollu-
tion data.
1. Self-powered sensors that are powered using the perpetual natural resources like solar
energy shall be used.
2. Quality sensors with better accuracy and precision, as well as ability to resist heat and
harsh environmental conditions will be used in the future. The use of such sensor will
measure the data with low missing values, errors and with greater stability.
3. Existing database i.e. PostgreSQL used for data storage can be replaced by big data
architecture to be able to meet the requirement of operations of huge amount of spatio-
temporal data.
4. The system application is limited to web browser. The system made for the visualisa-
tion of data shall be extended to the mobile application as well.
5. The near-real time Pi-sensor data shall be used for modelling and finding the pattern.
6. Sensors can be made to communicate with each other through the IOT technology.
This can save external network bandwidth as well as give rise to other interesting
applications.
7. Distributed Computing architecture can be made to utilise the processing power of
sensor devices when idle for on-device data processing and anomaly detection.
78
References
[1] F. Perera, “Pollution from fossil-fuel combustion is the leading environmental threat to
global pediatric health and equity: Solutions exist,” International journal of environ-
mental research and public health, vol. 15, no. 1, p. 16, 2017.
[2] P. S. Mahapatra, S. P. Puppala, B. Adhikary, K. L. Shrestha, D. P. Dawadi, S. P. Paudel,
and A. K. Panday, “Air quality trends of the kathmandu valley: A satellite, observation
and modeling perspective, 2019.
[3] F. W. Cathey and D. J. Dailey, “Transit vehicles as traffic probe sensors,” Transportation
Research Record, vol. 1804, no. 1, pp. 23–30, 2002.
[4] P. Mirchandani and L. Head, “A real-time traffic signal control system: architecture,
algorithms, and analysis,” Transportation Research Part C: Emerging Technologies,
vol. 9, no. 6, pp. 415–432, 2001.
[5] S. Bakrania, “Urbanisation and urban growth in Nepal,” Birmingham, UK: GSDRC,
University of Birmingham, 2015.
[6] K. Aberer, S. Sathe, D. Chakraborty, A. Martinoli, G. Barrenetxea, B. Faltings, and
L. Thiele, “Opensense: open community driven sensing of environment, in Proceed-
ings of the ACM SIGSPATIAL International Workshop on GeoStreaming. ACM, 2010,
pp. 39–42.
[7] D. G. Fox, “Judging air quality model performance: A summary of the ams workshop
on dispersion model performance, woods hole, mass., 8–11 september 1980, Bulletin
of the American Meteorological Society, vol. 62, no. 5, pp. 599–609, 1981.
[8] F. C. Collins, A comparison of spatial interpolation techniques in temperature estima-
tion, Ph.D. dissertation, Virginia Tech, 1995.
[9] N. Cressie, “The origins of kriging,” Mathematical geology, vol. 22, no. 3, pp. 239–252,
1990.
[10] T. Larson, S. B. Henderson, and M. Brauer, “Mobile monitoring of particle light ab-
sorption coefficient in an urban area as a basis for land use regression, Environmental
science & technology, vol. 43, no. 13, pp. 4672–4678, 2009.
[11] R. Bruntrup, S. Edelkamp, S. Jabbar, and B. Scholz, “Incremental map generation
with gps traces, in Proceedings. 2005 IEEE Intelligent Transportation Systems, 2005.
IEEE, 2005, pp. 574–579.
79
[12] T. Guo, K. Iwamura, and M. Koga, “Towards high accuracy road maps generation from
massive gps traces data, in 2007 IEEE International Geoscience and Remote Sensing
Symposium. IEEE, 2007, pp. 667–670.
[13] J. Krumm, “A survey of computational location privacy, Personal and Ubiquitous
Computing, vol. 13, no. 6, pp. 391–399, 2009.
[14] K.-L. Hui, H. H. Teo, and S.-Y. T. Lee, “The value of privacy assurance: an exploratory
field experiment, Mis Quarterly, pp. 19–33, 2007.
[15] G. Zyskind, O. Nathan et al., “Decentralizing privacy: Using blockchain to protect
personal data, in 2015 IEEE Security and Privacy Workshops. IEEE, 2015, pp. 180–
184.
[16] S. Chakraborty, N. Nagwani, and L. Dey, “Weather forecasting using incremental k-
means clustering, arXiv preprint arXiv:1406.4756, 2014.
[17] E. S. Pearson, “I. note on tests for normality, Biometrika, vol. 22, no. 3-4, pp. 423–424,
1931.
[18] R. B. D’agostino, A. Belanger, and R. B. D’Agostino Jr, A suggestion for using pow-
erful and informative tests of normality,” The American Statistician, vol. 44, no. 4, pp.
316–321, 1990.
[19] F. J. Anscombe and W. J. Glynn, “Distribution of the kurtosis statistic b2 for normal
samples. YALE UNIV NEW HAVEN CONN DEPT OF STATISTICS, Tech. Rep.,
1975.
[20] D. A. Dickey and W. A. Fuller, “Distribution of the estimators for autoregressive time
series with a unit root, Journal of the American statistical association, vol. 74, no.
366a, pp. 427–431, 1979.
[21] W. Enders, Applied econometric time series. John Wiley & Sons, 2008.
[22] D. Pelleg, A. W. Moore et al., “X-means: Extending k-means with efficient estimation
of the number of clusters. in Icml, vol. 1, 2000, pp. 727–734.
[23] D. M. Hawkins, Identification of outliers. Springer, 1980, vol. 11.
[24] P. Coulibaly and N. Evora, “Comparison of neural network methods for infilling miss-
ing daily weather records, Journal of hydrology, vol. 341, no. 1-2, pp. 27–41, 2007.
[25] I. Gorton, P. Greenfield, A. Szalay, and R. Williams, “Data-intensive computing in the
21st century, Computer, vol. 41, no. 4, pp. 30–32, 2008.
[26] D. Hankerson and A. Menezes, Elliptic curve cryptography. Springer, 2011.
80
A. APPENDIX
Figure A.1: Project schedule
Figure A.2: Home page
81
Figure A.3: Login page
Figure A.4: Screenshot of dashboard
82
Figure A.5: Screenshot of website
Figure A.6: Data visualisation tool
83
Figure A.7: About us page
Figure A.8: Sensor mounted on the top of vehicle for collecting weather data
84
Figure A.9: Bus with sensor mounted on top that is used for collection of data from the
route of Kathmandu to Pashupatinagar
Figure A.10: Data collection route path from Kathmandu to Pashupatinagar
85
Figure A.11: Epoch loss graph while LSTM model training. The graph shown R2value in
x-asix for 10 epochs.
Figure A.12: Hyper parameter tuning output for various architectural variables in LSTM
model. Color by R2test.
86
Figure A.13: Inside a LSTM layer of the LSTM model.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Kathmandu is the capital city of Nepal and a ‘hotspot’ of urban air pollution in South Asia. Its bowl-shaped structure (altitude 1300 m above sea level (masl); floor area 340 km2) surrounded by tall mountains provides a unique case study for analyzing pollution trapped by topography and local meteorology. In the absence of long-term in-situ observations, for the first time the columnar aerosol loading trend of the Kathmandu Valley was analyzed using satellite-derived aerosol optical depths (AOD). AOD from MODIS (Moderate Resolution Imaging Spectroradiometer) onboard Aqua and Terra (3 × 3 km, Level 2) was used during the dry season (November–June) in 2000–2015. Trend analysis of Kathmandu AOD (KAOD) (the Kathmandu Valley AOD (KValAOD) + background AOD (BAOD)) suggested an increase of ~35% during the study period. To derive the KValAOD trend, BAOD was subtracted from KAOD. The KValAOD trend indicated an increase of ~50-60 % during the study period, based on MODIS Aqua and Terra data. Thereafter, the background contribution only at the valley layer (1300-1400 masl) was determined using Cloud-Aerosol LiDAR and Infrared Pathfinder Satellite Observation (CALIPSO) profiles, in-situ observation and modeling techniques. The CALIPSO-based analysis indicated that background pollution contributed an additional 20-25% to local pollution. This finding was further supported by short-term in-situ measurements from Dhulikhel (the site where Kathmandu background measurements were taken) and Ratnapark (a Kathmandu city center site). Case studies were conducted using chemical transport models (WRF-STEM and WRF-Chem) to quantify the contribution of background air pollution to the Kathmandu valley pollution. These model results contradicted the satellite and in-situ observation by highly underestimating the Kathmandu Valley pollution levels. Comparison of visibility in Kathmandu with AOD suggests a profound role of BAOD on decreasing long-distance visibility in particular months.
Article
Full-text available
Fossil-fuel combustion by-products are the world's most significant threat to children's health and future and are major contributors to global inequality and environmental injustice. The emissions include a myriad of toxic air pollutants and carbon dioxide (CO₂), which is the most important human-produced climate-altering greenhouse gas. Synergies between air pollution and climate change can magnify the harm to children. Impacts include impairment of cognitive and behavioral development, respiratory illness, and other chronic diseases-all of which may be "seeded" in utero and affect health and functioning immediately and over the life course. By impairing children's health, ability to learn, and potential to contribute to society, pollution and climate change cause children to become less resilient and the communities they live in to become less equitable. The developing fetus and young child are disproportionately affected by these exposures because of their immature defense mechanisms and rapid development, especially those in low- and middle-income countries where poverty and lack of resources compound the effects. No country is spared, however: even high-income countries, especially low-income communities and communities of color within them, are experiencing impacts of fossil fuel-related pollution, climate change and resultant widening inequality and environmental injustice. Global pediatric health is at a tipping point, with catastrophic consequences in the absence of bold action. Fortunately, technologies and interventions are at hand to reduce and prevent pollution and climate change, with large economic benefits documented or predicted. All cultures and communities share a concern for the health and well-being of present and future children: this shared value provides a politically powerful lever for action. The purpose of this commentary is to briefly review the data on the health impacts of fossil-fuel pollution, highlighting the neurodevelopmental impacts, and to briefly describe available means to achieve a low-carbon economy, and some examples of interventions that have benefited health and the economy.
Article
For testing that an underlying population is normally distributed the skewness and kurtosis statistics, √b1and b2, and the D’Agostino-Pearson K2 statistic that combines these two statistics have been shown to be powerful and informative tests. Their use, however, has not been as prevalent as their usefulness. We review these tests and show how readily available and popular statistical software can be used to implement them. Their relationship to deviations from linearity in normal probability plotting is also presented.
Article
The deluge of data that future applications must process—in domains ranging from science to business informatics—creates a compelling argument for substantially increased R&D targeted at discovering scalable hardware and software solutions for data-intensive problems.
Article
Clustering is a powerful tool which has been used in several forecasting works, such as time series forecasting, real time storm detection, flood forecasting and so on. In this paper, a generic methodology for weather forecasting is proposed by the help of incremental K-means clustering algorithm. Weather forecasting plays an important role in day to day applications.Weather forecasting of this paper is done based on the incremental air pollution database of west Bengal in the years of 2009 and 2010. This paper generally uses typical K-means clustering on the main air pollution database and a list of weather category will be developed based on the maximum mean values of the clusters.Now when the new data are coming, the incremental K-means is used to group those data into those clusters whose weather category has been already defined. Thus it builds up a strategy to predict the weather of the upcoming data of the upcoming days. This forecasting database is totally based on the weather of west Bengal and this forecasting methodology is developed to mitigating the impacts of air pollutions and launch focused modeling computations for prediction and forecasts of weather events. Here accuracy of this approach is also measured.