Content uploaded by Fatmasari Fatmasari
Author content
All content in this area was uploaded by Fatmasari Fatmasari on Dec 08, 2018
Content may be subject to copyright.
Web Scraping Techniques to Collect Weather Data in
South Sumatera
Fatmasari
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
Fatmasasri@binadarma.ac.id
Yesi Novaria Kunang
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
yesinovariakunang@binadarma.ac.id
Susan Dian Purnamasari
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
susandian@binadarma.ac.id
Abstract—In some cities in South Sumatra has several
weather measurement points owned by several different
institutions such as BMKG, AngkasaPura, Lapan and others.
However, to get the latest weather data in detail within a certain
period constrained the bureaucratic process to each institution.
While the availability of weather datasets is needed in conducting
researches in the field of data analytics to predict the weather
and researches for DSS that require weather patterns. On the
other hand, some websites provide real-time weather data for
some cities. For that in this research we use web scraping
technology to collect weather data in some cities in South
Sumatera surrounding on some websites. Web scraping
technology is a technique for retrieving the contents of a web
page specifically. The data collected by this web scraping
technique will form a database or data warehouse that can be
used for further research for weather forecast data mining in
South Sumatra, which in the future can be developed weather-
based decision support application.
Keywords—component; formatting; style; styling; insert (key
words)
I. INTRODUCTION
Weather forecasts have an important impact on humans,
especially in the economic and social sphere. By collecting
weather data allows us to analyze patterns of data from annual
weather data that affect temperature patterns and precipitation.
some research utilizing weather patterns are used to study
weather patterns for agricultural use [1], in health [2],
transportation [3] [4], city planing [5]. To estimate the weather
itself is not easy because the weather is always changing and
also influenced by the nature of the atmosphere or the
dynamics of the atmosphere. Approaches in making weather
forecasts are highly dependent on observed data and the
procedures and methods of weather forecasting used. To
strengthen the weather forecast analysis it is necessary to
measure data not only at one point but at some point to see the
movement of the atmosphere, the movement of clouds, wind,
etc. so that the weather forecast is more valid.
In South Sumatra itself has several weather measurement
points owned by several different agencies such as BMKG,
Angkasa Pura, Lapan and others. However, to get the latest
weather data in detail in a certain period is very difficult due to
the bureaucratic process to each different agencies. On the
other hand some of the sites like https://weather.com,
https://www.accuweather.com, https://www.timeanddate.com,
https://www.worldweatheronline.com, provide updated
weather data by online.
For that in this research will utilize web scraping
technology to collect data from sites that provide realtime data.
The web scraping technique is a technique that makes it
possible to extract data from multiple websites into one
database or one spreadsheet making it easier to analyze and
visualize the data collected. The data collected by this web
scraping technique will form a database or weather dataset.
This research is a preliminary reasearch to prepare the weather
dataset that will be used for further research to research the
analytical data for South Sumatera weather forecast, as well as
weather patterns that can be utilized for decision development.
II. PREVIOUS WORK
Web scraping is a data collection technique with a program
that interacts with the API (application programming interface)
The web scraping technique is mostly done by creating
programs that automatically run queries to the web server,
requesting data (usually in HTML and other forms of web
pages), then parses the data to extract the necessary
information [6]. In practice web scraping uses various
programming and technology techniques, such as data analysis
, natural language parsing and information security.
Some research that explains the techniques of web scraping
techniques such as Fernanandez, et.al. [7] discusses the various
aspects of web scraping semantics and the sentimental
approach undertaken. Other studies discussed tools and
techniques that could be used to run web scraping [8], [9], [10].
Many tools and techniques are available which are mostly free
and easy to use. Johnson and Gupta do research on web content
mining techniques and connect with existing scraping tools. In
his paper many discuss the topic of data mining that exposes
taxonomy in web mining and compare it with web scraping.
Issues of utilization The web scraping technique itself is
widely discussed in several papers. Pereira in their paper [9]
discusses the tools and techniques used for scraping and its
impact on social networks. Other Utilization Techniques are
used to collect rental listing data from Craigslist web site [11].
The data collected is used to analyze the housing market, urban
dynamic and human behavior. Polidoro, et. al performs a web
scraping technique for conducting consumer price surveys with
reference to consumer electronics products (goods) and
airfares. The results show statistical data obtained with web
scraping techniques savingtime.
In some research also web scraping techniques are used to
search for literature, as in Haddaway study [12] which uses
import.io to search gray literature. A similar research was
conducted by Josi, et.al [13] who conducted an article search
with web scraping on Garuda, ISJD and Googel Scholar
portals. The search for specific research data in the field of
Psychological is done by Landers et al. [14]. While Marres [15]
conducted research using scraping techniques to seek linkage
issue in live social research.
The utilization of web scraping associated with weather
data is carried out by Novkovic, et al [4]. In his research web
scraping is used to collect traffic accident data for 15 years and
relate it to meteorological data. The results show with the
datamining there are linkages of several weather variables with
the level of traffic accident.
III. WEB SCRAPING WEATHER DATA
Fig. 1. Web Scraping weather Data Process
The process of web scraping weather data can be seen in
Figure 1, which in detail consists of several stages as follows:
The first step of studying the structure of HTML
documents from all websites that will be discredited.
This process is done to sort the data and elements to be
retrieved or stored.
The next step is to create a crawling program created
with a Python script using the Beautiful Soup and
Requests library. The results of the scraping data are
stored in the excel file.
Create Task Scheduler to run scripting data scraping
periodically every 1 hour. Auto task scheduler will
perform scraping data throughout the website and save
it into the results file.
The next process of web scraping is to extract the
crawling data. Extraction process is done with the help
of tools Pentaho Kettle. From the data obtained is done
cleaning process to remove unnecessary information
such as units of the stored variables. Transform data to
adjust formats and data structures as needed (eg city
data, time format, date and more). And do the process
of merging documents to unify the files scraping into
one file to facilitate the analysis process.
Create statistics of weather data and analysis of data
obtained according to the needs of application
development.
IV. RESULTS
A. Web Scraping Process Weather Data
Web Weather Scraping Application developed using
HTML Parsing created in Python programming language in
Anaconda platform running on Windows Operating System 10.
Script created using Beautiful Soup 4 library
(https://www.crummy.com/software/BeautifulSoup/ ) and the
Requests library. The sites in scraping are
https://www.worldweatheronline.com,
https://www.timeanddate.com and https://weather.com.
Created two scripts to perform data scraping.
The first script will scrap the data on the website
worldweatheronline.com for 8 cities in South Sumatera
covering the cities of Palembang, Indralaya,
Prabumulih, Lahat, Baturaja, PagarAlam, Tulung
Selapan and Lubuk Linggau. The stored data variable
consists of 13 variables (Gust, Perceipt, Humidity,
Time, Wind, Weather Forecast, Rain, Pressure, Feels,
Temperature, Cold, Date and Wind direction). The
variable date is obtained by running the date time
function converted to time based on timezone in Asia /
Jakarta.
The second step script to collect data Palembang city in
get from Weather Station Sultan Mahmud Badarudin II
for website time and date and weather.com. For time
and date websites there are 11 variables stored
(Dew_point, Fells_des, Humidity, Temperature,
Forecast, Wind, Weather Description, Hour, Visibility,
Pressure, Date). While on the weather.com website
there are 8 variables that are stored (Feels, Temperature,
Description, Hour, Date, Precept, Wind, Humidity).
To run Script crawling data automatically and done
periodically every hour, so in this research the researcher do
hosting Python script in Python server any where
https://www.pythonanywhere.com/. To run the script
continuously every hour, it is necessary to create a task
scheduler in. This task scheduler will run both scripts
automatically every hour
B. Data Extraction and Transformation
Fig. 2. Weather Data extraction and Transformation process
worldweatheronline.com
Results from the data files obtained from the scraping
process can not be directly used for the analytics process
because there are units on the data variable taken. In addition it
also needs to adjust the format of the data read. To facilitate the
process of data transformation and extraction, the process is
done by using tools Pentaho Data Integration (in this study
using PDI version 5.0.1.).
The ETL process (Extract Transform Load) performed for
the online weather weather data can be seen in Figure 2. The
steps of the ETL process are as follows:
The transformation process of one city will read the
resulting file from the extraction data. The result of this
step is to read the csv file and then change the data
format as needed. The next step process is to do data
cleaning (remove the unit from data variable).
Continued step that transforms rainy field data to get
output in accordance with the chill. The next step will
add description of city data and latitude (if needed).
And the end result will be saved in the weather sumsel
file
To run the entire transformation process, a job that will
run the entire city transformation process in sequence.
The process is done sequentially rather than
simultaneously to avoid data conflicts caused by file
access simultaneously.
The end result of this ETL process will generate the
weathersumsel.csv file which is the result file for the eight
observed cities ready for use in the analytics data process. This
job process itself can be combined with the job to download
the files scraping to the local computer on a scheduled basis.
Job process can run automatically with task scheduler. For
example the schedule of adding data for the purposes of
analytics done every day at certain hours. Running job
automatically run after the process of extracting the download
is done.
For the ETL process the data from scraping weather.com
and timeanddate.com results are similar to the above process,
but the process is simpler. In both files are only done cleaning
process, extract and transform without merging the file.
B. Data Statistic and Analytic Process
The results of data collected by web scraping techniques
can then be presented in the form of statistics. Data statistics
are created using Python programming language. Data can be
presented as needed by doing the process of data grouping. For
the purposes of analytics data can be presented in the form of a
table as in Figure 4 or in the form of graphs and charts.
Fig. 3. Presentation Example of data in table form for analytics purposes
Other forms of statistical data in this study are also
presented in the form of charts and graphics as shown in Figure
4. With the presentation in the form of histrogam make it easier
for users to see the statistics of weather data from each city to
be observed. In addition, the histogram can be presented in
daily, monthly and yearly forms as required by using the group
function by the Date variable.
Fig. 4. Statistics Data presented in the form of Chart and Graph
The study itself is a new preliminary study to the stage of
collecting weather data in South Sumatra. Therefore the
process of analytics data for collected weather data is still
limited to the presentation of statistical data. While the analytic
process in detail for example for weather prediction, weather
pattern mapping can not be done. This is because the weather
data collection period that runs only run one month. So the data
collected is still relatively small (5909 records). The data can
not be used to make predictions and weather forecasts. Because
the prediction process itself ideally uses a lot of weather data to
produce accurate predictions of good results. For the future
data collection process with web scraping techniques will
continue to be done continuously. After the data collected
enough (more than one year), then the next stage of research
will be made weather prediction using the approach of data
mining and machine learning. In addition, the future results of
data collected by web scraping techniques will also be used to
study the relevance of weather patterns for decision making in
the field of transportation and agriculture.
C. Legal Issue Aspect
The legality and fair use of the use of web scraping
techniques is often an issue. There are two aspects to consider
in doing web scraping techniques, namely copyright and entry
without permission [16].
First for copyright issues, in the case of Craigslist Inc.
v. 3Taps Inc. 2013), a federal district court ruling on
scarping data is not a copyright infringement for
publicly available data. In the study of the use of web
scraping techniques in this weather data, the scraping
website presents data publicly. In addition, this research
does access on a regular basis, does not burden the
website (only 1 request per hour) and does not damage
the website hosts of the accessed data.
For the second aspect, enter without permission. In the
process of webscraping made the website accessible
publicly accessible freely. Not being a parasite for host
data. IP used at the time of the research is not a blocked
IP, and access is done without having to bypass any
proxy and the data accessed is not encrypted data. So
from the aspect of the legality of the web scraping
process is done does not violate anything. The results of
data collected are not used for commercial purposes but
for research purposes. The research process also does
not repack or repeat the data but to be analyzed.
V. CONCLUSSION AND FUTURE WORKS
In this research web scraping techniques successfully
utilized to collect weather data in several cities in South
Sumatra. The process of web scraping is done to collect data
from several websites that present weather data. The web
scraping model using the Python programming language
manages to collect data automatically and continuously every
hour. And the resulting data is very detailed and can be used
further for the purposes of data analytics.
Limitations of current research are newly generated
analytics data in the form of statistical data due to the short
period of data collection that has been running (1 month). The
data collection process will continue to run to produce the
weather dataset of cities in South Sumatra. Once the datasets
are collected considerable research will be developed to predict
the weather as well as weather patterns research for decision
making in agriculture and transportation.
ACKNOWLEDGMENT
Thanks to the ristekdikti who has funded this research.
REFERENCES
[1] C. Lesk, P. Rowhani, and N. Ramankutty, “Influence of extreme
weather disasters on global crop production,” Nature, vol. 529, no.
7584, pp. 84–87, Jan. 2016.
[2] J. H. Hashim and Z. Hashim, “Climate change, extreme weather events,
and human health implications in the Asia Pacific region,” Asia Pac. J.
Public Health, vol. 28, no. 2_suppl, pp. 8S–14S, 2016.
[3] G. J. Zheng et al., “Exploring the severe winter haze in Beijing: the
impact of synoptic weather, regional transport and heterogeneous
reactions,” Atmospheric Chem. Phys., vol. 15, no. 6, pp. 2969–2983,
Mar. 2015.
[4] M. Novkovic, M. Arsenovic, S. Sladojevic, A. Anderla, and D.
Stefanovic, “Data science applied to extract insights from data - weather
data influence on traffic accidents,” p. 7.
[5] M. Hebbert, “Climatology for city planning in historical perspective,”
Urban Clim., vol. 10, pp. 204–215, Dec. 2014.
[6] R. Mitchell, Web Scraping with Python: Collecting More Data from the
Modern Web. O’Reilly Media, Inc., 2018.
[7] J. I. Fernández Villamor, J. Blasco Garcia, C. A. Iglesias Fernandez, and
M. Garijo Ayestaran, “A semantic scraping model for web resources-
Applying linked data to web page screen scraping,” 2011.
[8] S. de S Sirisuriya, “A comparative study on web scraping,” 2015.
[9] Renita Crystal Pereira, Vanitha T, “Web Scraping of Social Networks,”
Int. J. Innov. Res. Comput. Commun. Eng., vol. 3, no. 7, pp. 237–240,
Oct. 2015.
[10] G. Barcaroli and D. Summa, “Using Internet as a Data Source for
Official Statistics: a Comparative Analysis of Web Scraping
Technologies,” p. 5.
[11] G. Boeing and P. Waddell, “New Insights into Rental Housing Markets
across the United States: Web Scraping and Analyzing Craigslist Rental
Listings,” J. Plan. Educ. Res., vol. 37, no. 4, pp. 457–476, Dec. 2017.
[12] N. R. Haddaway, “The Use of Web-scraping Software in Searching for
Grey Literature,” vol. 11, no. 3, p. 6, 2015.
[13] A. Josi and L. A. Abdillah, “PENERAPAN TEKNIK WEB SCRAPING
PADA MESIN PENCARI ARTIKEL ILMIAH,” p. 6.
[14] R. N. Landers, R. C. Brusso, K. J. Cavanaugh, and A. B. Collmus, “A
primer on theory-driven web scraping: Automatic extraction of big data
from the Internet for use in psychological research.,” Psychol. Methods,
vol. 21, no. 4, pp. 475–492, 2016.
[15] N. Marres and E. Weltevrede, “SCRAPING THE SOCIAL?: Issues in
live social research,” J. Cult. Econ., vol. 6, no. 3, pp. 313–335, Aug.
2013.
[16] J. Hirschey, “Symbiotic Relationships: Pragmatic Acceptance of Data
Scraping,” SSRN Electron. J., 2014.