Conference PaperPDF Available

Web Scraping Techniques to Collect Weather Data in South Sumatera

Authors:

Abstract and Figures

In some cities in South Sumatra has several weather measurement points owned by several different institutions such as BMKG, AngkasaPura, Lapan and others. However, to get the latest weather data in detail within a certain period constrained the bureaucratic process to each institution. While the availability of weather datasets is needed in conducting researches in the field of data analytics to predict the weather and researches for DSS that require weather patterns. On the other hand, some websites provide real-time weather data for some cities. For that in this research we use web scraping technology to collect weather data in some cities in South Sumatera surrounding on some websites. Web scraping technology is a technique for retrieving the contents of a web page specifically. The data collected by this web scraping technique will form a database or data warehouse that can be used for further research for weather forecast data mining in South Sumatra, which in the future can be developed weather-based decision support application.
Content may be subject to copyright.
Web Scraping Techniques to Collect Weather Data in
South Sumatera
Fatmasari
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
Fatmasasri@binadarma.ac.id
Yesi Novaria Kunang
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
yesinovariakunang@binadarma.ac.id
Susan Dian Purnamasari
Information System,
Computer Science Department
Universitas Bina Darma
Palembang, Indonesia
susandian@binadarma.ac.id
AbstractIn some cities in South Sumatra has several
weather measurement points owned by several different
institutions such as BMKG, AngkasaPura, Lapan and others.
However, to get the latest weather data in detail within a certain
period constrained the bureaucratic process to each institution.
While the availability of weather datasets is needed in conducting
researches in the field of data analytics to predict the weather
and researches for DSS that require weather patterns. On the
other hand, some websites provide real-time weather data for
some cities. For that in this research we use web scraping
technology to collect weather data in some cities in South
Sumatera surrounding on some websites. Web scraping
technology is a technique for retrieving the contents of a web
page specifically. The data collected by this web scraping
technique will form a database or data warehouse that can be
used for further research for weather forecast data mining in
South Sumatra, which in the future can be developed weather-
based decision support application.
Keywordscomponent; formatting; style; styling; insert (key
words)
I. INTRODUCTION
Weather forecasts have an important impact on humans,
especially in the economic and social sphere. By collecting
weather data allows us to analyze patterns of data from annual
weather data that affect temperature patterns and precipitation.
some research utilizing weather patterns are used to study
weather patterns for agricultural use [1], in health [2],
transportation [3] [4], city planing [5]. To estimate the weather
itself is not easy because the weather is always changing and
also influenced by the nature of the atmosphere or the
dynamics of the atmosphere. Approaches in making weather
forecasts are highly dependent on observed data and the
procedures and methods of weather forecasting used. To
strengthen the weather forecast analysis it is necessary to
measure data not only at one point but at some point to see the
movement of the atmosphere, the movement of clouds, wind,
etc. so that the weather forecast is more valid.
In South Sumatra itself has several weather measurement
points owned by several different agencies such as BMKG,
Angkasa Pura, Lapan and others. However, to get the latest
weather data in detail in a certain period is very difficult due to
the bureaucratic process to each different agencies. On the
other hand some of the sites like https://weather.com,
https://www.accuweather.com, https://www.timeanddate.com,
https://www.worldweatheronline.com, provide updated
weather data by online.
For that in this research will utilize web scraping
technology to collect data from sites that provide realtime data.
The web scraping technique is a technique that makes it
possible to extract data from multiple websites into one
database or one spreadsheet making it easier to analyze and
visualize the data collected. The data collected by this web
scraping technique will form a database or weather dataset.
This research is a preliminary reasearch to prepare the weather
dataset that will be used for further research to research the
analytical data for South Sumatera weather forecast, as well as
weather patterns that can be utilized for decision development.
II. PREVIOUS WORK
Web scraping is a data collection technique with a program
that interacts with the API (application programming interface)
The web scraping technique is mostly done by creating
programs that automatically run queries to the web server,
requesting data (usually in HTML and other forms of web
pages), then parses the data to extract the necessary
information [6]. In practice web scraping uses various
programming and technology techniques, such as data analysis
, natural language parsing and information security.
Some research that explains the techniques of web scraping
techniques such as Fernanandez, et.al. [7] discusses the various
aspects of web scraping semantics and the sentimental
approach undertaken. Other studies discussed tools and
techniques that could be used to run web scraping [8], [9], [10].
Many tools and techniques are available which are mostly free
and easy to use. Johnson and Gupta do research on web content
mining techniques and connect with existing scraping tools. In
his paper many discuss the topic of data mining that exposes
taxonomy in web mining and compare it with web scraping.
Issues of utilization The web scraping technique itself is
widely discussed in several papers. Pereira in their paper [9]
discusses the tools and techniques used for scraping and its
impact on social networks. Other Utilization Techniques are
used to collect rental listing data from Craigslist web site [11].
The data collected is used to analyze the housing market, urban
dynamic and human behavior. Polidoro, et. al performs a web
scraping technique for conducting consumer price surveys with
reference to consumer electronics products (goods) and
airfares. The results show statistical data obtained with web
scraping techniques savingtime.
In some research also web scraping techniques are used to
search for literature, as in Haddaway study [12] which uses
import.io to search gray literature. A similar research was
conducted by Josi, et.al [13] who conducted an article search
with web scraping on Garuda, ISJD and Googel Scholar
portals. The search for specific research data in the field of
Psychological is done by Landers et al. [14]. While Marres [15]
conducted research using scraping techniques to seek linkage
issue in live social research.
The utilization of web scraping associated with weather
data is carried out by Novkovic, et al [4]. In his research web
scraping is used to collect traffic accident data for 15 years and
relate it to meteorological data. The results show with the
datamining there are linkages of several weather variables with
the level of traffic accident.
III. WEB SCRAPING WEATHER DATA
Fig. 1. Web Scraping weather Data Process
The process of web scraping weather data can be seen in
Figure 1, which in detail consists of several stages as follows:
The first step of studying the structure of HTML
documents from all websites that will be discredited.
This process is done to sort the data and elements to be
retrieved or stored.
The next step is to create a crawling program created
with a Python script using the Beautiful Soup and
Requests library. The results of the scraping data are
stored in the excel file.
Create Task Scheduler to run scripting data scraping
periodically every 1 hour. Auto task scheduler will
perform scraping data throughout the website and save
it into the results file.
The next process of web scraping is to extract the
crawling data. Extraction process is done with the help
of tools Pentaho Kettle. From the data obtained is done
cleaning process to remove unnecessary information
such as units of the stored variables. Transform data to
adjust formats and data structures as needed (eg city
data, time format, date and more). And do the process
of merging documents to unify the files scraping into
one file to facilitate the analysis process.
Create statistics of weather data and analysis of data
obtained according to the needs of application
development.
IV. RESULTS
A. Web Scraping Process Weather Data
Web Weather Scraping Application developed using
HTML Parsing created in Python programming language in
Anaconda platform running on Windows Operating System 10.
Script created using Beautiful Soup 4 library
(https://www.crummy.com/software/BeautifulSoup/ ) and the
Requests library. The sites in scraping are
https://www.worldweatheronline.com,
https://www.timeanddate.com and https://weather.com.
Created two scripts to perform data scraping.
The first script will scrap the data on the website
worldweatheronline.com for 8 cities in South Sumatera
covering the cities of Palembang, Indralaya,
Prabumulih, Lahat, Baturaja, PagarAlam, Tulung
Selapan and Lubuk Linggau. The stored data variable
consists of 13 variables (Gust, Perceipt, Humidity,
Time, Wind, Weather Forecast, Rain, Pressure, Feels,
Temperature, Cold, Date and Wind direction). The
variable date is obtained by running the date time
function converted to time based on timezone in Asia /
Jakarta.
The second step script to collect data Palembang city in
get from Weather Station Sultan Mahmud Badarudin II
for website time and date and weather.com. For time
and date websites there are 11 variables stored
(Dew_point, Fells_des, Humidity, Temperature,
Forecast, Wind, Weather Description, Hour, Visibility,
Pressure, Date). While on the weather.com website
there are 8 variables that are stored (Feels, Temperature,
Description, Hour, Date, Precept, Wind, Humidity).
To run Script crawling data automatically and done
periodically every hour, so in this research the researcher do
hosting Python script in Python server any where
https://www.pythonanywhere.com/. To run the script
continuously every hour, it is necessary to create a task
scheduler in. This task scheduler will run both scripts
automatically every hour
B. Data Extraction and Transformation
Fig. 2. Weather Data extraction and Transformation process
worldweatheronline.com
Results from the data files obtained from the scraping
process can not be directly used for the analytics process
because there are units on the data variable taken. In addition it
also needs to adjust the format of the data read. To facilitate the
process of data transformation and extraction, the process is
done by using tools Pentaho Data Integration (in this study
using PDI version 5.0.1.).
The ETL process (Extract Transform Load) performed for
the online weather weather data can be seen in Figure 2. The
steps of the ETL process are as follows:
The transformation process of one city will read the
resulting file from the extraction data. The result of this
step is to read the csv file and then change the data
format as needed. The next step process is to do data
cleaning (remove the unit from data variable).
Continued step that transforms rainy field data to get
output in accordance with the chill. The next step will
add description of city data and latitude (if needed).
And the end result will be saved in the weather sumsel
file
To run the entire transformation process, a job that will
run the entire city transformation process in sequence.
The process is done sequentially rather than
simultaneously to avoid data conflicts caused by file
access simultaneously.
The end result of this ETL process will generate the
weathersumsel.csv file which is the result file for the eight
observed cities ready for use in the analytics data process. This
job process itself can be combined with the job to download
the files scraping to the local computer on a scheduled basis.
Job process can run automatically with task scheduler. For
example the schedule of adding data for the purposes of
analytics done every day at certain hours. Running job
automatically run after the process of extracting the download
is done.
For the ETL process the data from scraping weather.com
and timeanddate.com results are similar to the above process,
but the process is simpler. In both files are only done cleaning
process, extract and transform without merging the file.
B. Data Statistic and Analytic Process
The results of data collected by web scraping techniques
can then be presented in the form of statistics. Data statistics
are created using Python programming language. Data can be
presented as needed by doing the process of data grouping. For
the purposes of analytics data can be presented in the form of a
table as in Figure 4 or in the form of graphs and charts.
Fig. 3. Presentation Example of data in table form for analytics purposes
Other forms of statistical data in this study are also
presented in the form of charts and graphics as shown in Figure
4. With the presentation in the form of histrogam make it easier
for users to see the statistics of weather data from each city to
be observed. In addition, the histogram can be presented in
daily, monthly and yearly forms as required by using the group
function by the Date variable.
Fig. 4. Statistics Data presented in the form of Chart and Graph
The study itself is a new preliminary study to the stage of
collecting weather data in South Sumatra. Therefore the
process of analytics data for collected weather data is still
limited to the presentation of statistical data. While the analytic
process in detail for example for weather prediction, weather
pattern mapping can not be done. This is because the weather
data collection period that runs only run one month. So the data
collected is still relatively small (5909 records). The data can
not be used to make predictions and weather forecasts. Because
the prediction process itself ideally uses a lot of weather data to
produce accurate predictions of good results. For the future
data collection process with web scraping techniques will
continue to be done continuously. After the data collected
enough (more than one year), then the next stage of research
will be made weather prediction using the approach of data
mining and machine learning. In addition, the future results of
data collected by web scraping techniques will also be used to
study the relevance of weather patterns for decision making in
the field of transportation and agriculture.
C. Legal Issue Aspect
The legality and fair use of the use of web scraping
techniques is often an issue. There are two aspects to consider
in doing web scraping techniques, namely copyright and entry
without permission [16].
First for copyright issues, in the case of Craigslist Inc.
v. 3Taps Inc. 2013), a federal district court ruling on
scarping data is not a copyright infringement for
publicly available data. In the study of the use of web
scraping techniques in this weather data, the scraping
website presents data publicly. In addition, this research
does access on a regular basis, does not burden the
website (only 1 request per hour) and does not damage
the website hosts of the accessed data.
For the second aspect, enter without permission. In the
process of webscraping made the website accessible
publicly accessible freely. Not being a parasite for host
data. IP used at the time of the research is not a blocked
IP, and access is done without having to bypass any
proxy and the data accessed is not encrypted data. So
from the aspect of the legality of the web scraping
process is done does not violate anything. The results of
data collected are not used for commercial purposes but
for research purposes. The research process also does
not repack or repeat the data but to be analyzed.
V. CONCLUSSION AND FUTURE WORKS
In this research web scraping techniques successfully
utilized to collect weather data in several cities in South
Sumatra. The process of web scraping is done to collect data
from several websites that present weather data. The web
scraping model using the Python programming language
manages to collect data automatically and continuously every
hour. And the resulting data is very detailed and can be used
further for the purposes of data analytics.
Limitations of current research are newly generated
analytics data in the form of statistical data due to the short
period of data collection that has been running (1 month). The
data collection process will continue to run to produce the
weather dataset of cities in South Sumatra. Once the datasets
are collected considerable research will be developed to predict
the weather as well as weather patterns research for decision
making in agriculture and transportation.
ACKNOWLEDGMENT
Thanks to the ristekdikti who has funded this research.
REFERENCES
[1] C. Lesk, P. Rowhani, and N. Ramankutty, “Influence of extreme
weather disasters on global crop production,” Nature, vol. 529, no.
7584, pp. 8487, Jan. 2016.
[2] J. H. Hashim and Z. Hashim, “Climate change, extreme weather events,
and human health implications in the Asia Pacific region,” Asia Pac. J.
Public Health, vol. 28, no. 2_suppl, pp. 8S14S, 2016.
[3] G. J. Zheng et al., “Exploring the severe winter haze in Beijing: the
impact of synoptic weather, regional transport and heterogeneous
reactions,” Atmospheric Chem. Phys., vol. 15, no. 6, pp. 29692983,
Mar. 2015.
[4] M. Novkovic, M. Arsenovic, S. Sladojevic, A. Anderla, and D.
Stefanovic, “Data science applied to extract insights from data - weather
data influence on traffic accidents,” p. 7.
[5] M. Hebbert, “Climatology for city planning in historical perspective,”
Urban Clim., vol. 10, pp. 204215, Dec. 2014.
[6] R. Mitchell, Web Scraping with Python: Collecting More Data from the
Modern Web. O’Reilly Media, Inc., 2018.
[7] J. I. Fernández Villamor, J. Blasco Garcia, C. A. Iglesias Fernandez, and
M. Garijo Ayestaran, A semantic scraping model for web resources-
Applying linked data to web page screen scraping,” 2011.
[8] S. de S Sirisuriya, “A comparative study on web scraping,” 2015.
[9] Renita Crystal Pereira, Vanitha T, “Web Scraping of Social Networks,”
Int. J. Innov. Res. Comput. Commun. Eng., vol. 3, no. 7, pp. 237240,
Oct. 2015.
[10] G. Barcaroli and D. Summa, “Using Internet as a Data Source for
Official Statistics: a Comparative Analysis of Web Scraping
Technologies,” p. 5.
[11] G. Boeing and P. Waddell, “New Insights into Rental Housing Markets
across the United States: Web Scraping and Analyzing Craigslist Rental
Listings,” J. Plan. Educ. Res., vol. 37, no. 4, pp. 457476, Dec. 2017.
[12] N. R. Haddaway, “The Use of Web-scraping Software in Searching for
Grey Literature,” vol. 11, no. 3, p. 6, 2015.
[13] A. Josi and L. A. Abdillah, “PENERAPAN TEKNIK WEB SCRAPING
PADA MESIN PENCARI ARTIKEL ILMIAH,” p. 6.
[14] R. N. Landers, R. C. Brusso, K. J. Cavanaugh, and A. B. Collmus, “A
primer on theory-driven web scraping: Automatic extraction of big data
from the Internet for use in psychological research.,” Psychol. Methods,
vol. 21, no. 4, pp. 475492, 2016.
[15] N. Marres and E. Weltevrede, “SCRAPING THE SOCIAL?: Issues in
live social research,” J. Cult. Econ., vol. 6, no. 3, pp. 313335, Aug.
2013.
[16] J. Hirschey, “Symbiotic Relationships: Pragmatic Acceptance of Data
Scraping,” SSRN Electron. J., 2014.
... Other names for scraping systems are ants, automatic indexers, bots, web spiders or web robots [9]. Scraping systems can retrieve data by providing a timer for the system to automatically retrieve data every few minutes [10]. The traditional ETL process is run periodically at certain intervals, such as monthly or weekly. ...
Article
Full-text available
The SOLAP system for Indonesian Agricultural Commodities is a successful development based on previous studies. Agricultural commodity data are managed in a data warehouse with a galactic schema, which has 7 fact tables, namely cut flower horticulture, ornamental plant horticulture, horticulture, food crops, plantation, livestock population, and livestock production, as well as 3 dimensional tables, namely location, time, and commodity. The results of SOLAP operations on the system can be visualized in the form of crosstabs, graphs and maps. The system uses a web platform so that it can be accessed by the public. However, the SOLAP system cannot update data in real time. This study aims to develop a data warehouse for Indonesian Agricultural Commodities SOLAP in real time by creating a scraping system. This study has succeeded in developing a data warehouse in real time on the indonesian agricultural commodity SOLAP system by creating a real time scraping system that is applied to the SOLAP server and has succeeded in making the ETL process run in real time on the SOLAP server and optimizing polygon-based spatial data visualization using the Douglas-Peucker. This study has also carried out functional testing of OLAP features and functions on the Indonesian Agricultural Commodity SOLAP system using the black box testing method. The results of this study provide accurate and real-time data on the SOLAP of Indonesian Agricultural Commodities, with the results of SOLAP feature testing achieving 100 percent pass and the data conformity test results of OLAP function as expected. In addition, the results of this study make it possible to automatically update the data according to a predetermined schedule to provide real-time information.
... Finding the purpose of the web scrapping undertaking is crucial to any project's success [23]. Having a purpose helps recognize patterns and analyse the data through different statistical graphs or charts. ...
Article
In today's technologically advanced world, large enterprises across various industries rely heavily on data for everything, from business growth to expansion. As a result, it would not be too far off to say that data is the electricity that powers all of today's industries. Large enterprises use data to improve performance, generate revenue, and provide better customer service. In fact, the automobile industry is also harnessing the power of data to improve vehicle safety. Their goal is to build powerful machines that think in data form. With that in mind, web scraping refers to the use of bots in extracting crucial data and content from websites. Web scraping is classified into three essential fragments, i.e., the web scraper drawing the desired links from the internet, using the source links to extract information, and storing the gathered information into a well-structured CSV file. Python programming language can be used to aid the process. Through linking various libraries, Python can be a valuable resource in gathering or scraping desired data from multiple targeted websites online. Consequently, this paper focuses on the efficacy of using version 3.6. /11 Python programming language in web scraping for data-centric applications.
... In a case study for gathering biomedical data, the authors of (Glez-Peña et al., 2014) demonstrated the usefulness of web scraping to extract data from different sources. Also in the works of (Bonifacio et al., 2015) and (Kunang et al., 2018) it was demonstrated how the application of web scraping for gathering climate and weather data from different sources could increase the productivity of researchers in the field. Another example of the effectiveness of web scraping was demonstrated by the Vigi4Med (Audeh et al., 2017) tool. ...
Conference Paper
Full-text available
Web scraping is a widely-used technique to extract unstructured data from different websites and transform it into a unified and structured form. Due to the nature of the WWW, long-term and continuous web scraping is a volatile and error-prone endeavor. The setup of a reliable extraction procedure comes along with various challenges. In this paper, a system design and implementation for a pipeline-oriented approach to web scraping is proposed. The main goal of the proposal is to establish a fault-tolerant execution of web scraping tasks with proper error handling strategies set in place. As errors are prevalent in web scraping, logging and error replication procedures are part of the processing pipeline. These mechanisms allow for effectively adapting web scraper implementations to evolving website targets. An implementation of the system was evaluated in a real-world case study, where thousands of web pages were scraped and processed on a daily basis. The results indicated that the system allows for effectively operating reliable and long-term web scraping endeavors.
... This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. The web scraping techniques are mostly done by creating programs that automatically run queries to the web server, requesting data (usually in HTML and other forms of web pages), then parses the data to extract the necessary information and to analyze the weather related analysis in South Sumatera [7]. Marco Scarno, et.al [8] investigated the possibilities of structuring data from different websites through web scraping techniques and exploited what is offered by some web search engines to progressively create queries that enabled them to select the most useful information they needed. ...
Article
The prices of the products in the E-commerce sites change frequently. It becomes very difficult for the users to monitor the prices and get the best deal available on the internet. The proposed model tackles this problem by creating a user-friendly model using Web scraping and machine learning concepts so that the model can be used by the users to monitor and compare the prices of products across the websites, send an email alert notification when there is a price drop and also to predict the future prices.
... Adapun parameter data yang akan di ekstrak adalah rata-rata jumlah suka, rata-rata jumlah komentar, tingkat keterbaruan terbitan, dan jumlah pengikut. Tahapan proses ekstraksi menggunakan metode Web Scraping terdiri dari analisis strukur data web dan pembuatan mesin crawl untuk mem-parsing dokumen HTML dan XML [21] dari halaman web. Penerapan metode Web Scraping pada penelitian ini dilakukan dengan memanfaatkan sebuah pustaka bernama Beautiful Soup yang tersedia dalam bahasa pemrograman Python. ...
Article
Full-text available
Purpose: This study aims to apply the web data extraction method to extract student Instagram account data and the K-Means data mining method to perform clustering automatically to determine the best cluster of students' Instagram accounts as influencers for new student admissions.Design/methodology/approach: This study implemented the web data extraction method to extract student Instagram account data. This study also implemented a data mining method called K-Means to cluster data and the Silhouette Coefficient method to determine the best number of clusters.Findings/result: This study has succeeded in determining the seven best student accounts from 100 accounts that can be used as influencers for new student admissions with the highest Silhouette Score for the number of influencers selected between 5-10, which is 0.608 of the 22 clusters.Originality/value/state of the art: Research related to the determination of the best cluster of students' Instagram accounts as new student admissions influencers using web data extraction and K-Means has never been done in previous studies.
... This process aims to extract the keyword suggestion data, as well as "allintitle". The stages of the extraction process using the web scraping method in this study are as follows [18]: a. In the analysis phase, the HTML and JSON structure of the Google search engine website were studied. ...
Article
Keyword research is one of the essential activities in Search Engine Optimization (SEO). One of the techniques in doing keyword research is to find out how many articles titles on a website indexed by the Google search engine contain a particular keyword or so-called "allintitle". Moreover, search engines are also able to provide keywords suggestion. Getting keywords suggestions and allintitle will not be effective, efficient, and economical if done manually for relatively extensive keyword research. It will take a long time to decide whether a keyword is needed to be optimized. Based on these problems, this study aimed to analyze the implementation of the web scraping technique to get relevant keyword suggestions from the Google search engine and the number of "allintitle" that are owned automatically. The data used as an experiment in this test consists of ten keywords, which each keyword would generate a maximum of ten keywords suggestion. Therefore, from ten keywords, it will produce at most 100 keywords suggestions and the number of allintitles. Based on the evaluation result, we got an accuracy of 100%. It indicated that the technique could be applied to get keywords suggestions and allintitle from Google search engines with outstanding accuracy values.
Preprint
Full-text available
Despite the significant advances achieved in Artificial Neural Networks (ANNs), their design process remains notoriously tedious, depending primarily on intuition, experience and trial-and-error. This human-dependent process is often time-consuming and prone to errors. Furthermore, the models are generally bound to their training contexts, with no considerations of changes to their surrounding environments. Continual adaptability and automation of neural networks is of paramount importance to several domains where model accessibility is limited after deployment (e.g IoT devices, self-driving vehicles, etc). Additionally, even accessible models require frequent maintenance post-deployment to overcome issues such as Concept/Data Drift, which can be cumbersome and restrictive. The current state of the art on adaptive ANNs is still a premature area of research; nevertheless, Neural Architecture Search (NAS), a form of AutoML, and Continual Learning (CL) have recently gained an increasing momentum in the Deep Learning research field, aiming to provide more robust and adaptive ANN development frameworks. This study is the first extensive review on the intersection between AutoML and CL, outlining research directions for the different methods that can facilitate full automation and lifelong plasticity in ANNs.
Book
This book constitutes refereed proceedings of the 13th International Conference on International Conference on Computational Collective Intelligence, ICCCI 2021, held in Kallithea, Rhodes, Greece, in October - November 2021. Due to the COVID-19 pandemic the conference was held online. The 44 full papers and 14 short papers were thoroughly reviewed and selected from 231 submissions. The papers are organized according to the following topical sections: social networks and recommender systems; collective decision-making; computer vision techniques; innovations in intelligent systems; cybersecurity intelligent methods; data mining and machine learning; machine learning in real-world data; Internet of Things and computational technologies for collective intelligence; smart industry and management systems; low resource languages processing; computational intelligence for multimedia understanding.
Conference Paper
Full-text available
An increasingly growing amount of publicly available data opens countless possibilities for research and analysis in the field of data science. Data mining enables us to summarize large amounts of data in a manner that allows discovery of hidden patterns in collected data. Based on those patterns it is possible to generate predictive models and improve decision-making processes. This study illustrates how data mining techniques are applied to a dataset consisting of meteorological and traffic accident data for the purpose of determining the correlation between different weather factors and traffic accident occurrences. We rely on publicly available meteorological data and web scrapping as a form of gathering large amount of traffic accident data in a period of 15 years. Rainfall, daily changes in temperature, cloudiness, and humidity have been found to be in correlation with the occurrences of traffic accidents.
Article
Full-text available
Current sources of data on rental housing—such as the census or commercial databases that focus on large apartment complexes—do not reflect recent market activity or the full scope of the US rental market. To address this gap, we collected, cleaned, analyzed, mapped, and visualized eleven million Craigslist rental housing listings. The data reveal fine-grained spatial and temporal patterns within and across metropolitan housing markets in the United States. We find that some metropolitan areas have only single-digit percentages of listings below fair market rent. Nontraditional sources of volunteered geographic information offer planners real-time, local-scale estimates of rent and housing characteristics currently lacking in alternative sources, such as census data.
Article
Full-text available
The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects.
Article
Full-text available
In recent years, several extreme weather disasters have partially or completely damaged regional crop production. While detailed regional accounts of the effects of extreme weather disasters exist, the global scale effects of droughts, floods and extreme temperature on crop production are yet to be quantified. Here we estimate for the first time, to our knowledge, national cereal production losses across the globe resulting from reported extreme weather disasters during 1964-2007. We show that droughts and extreme heat significantly reduced national cereal production by 9-10%, whereas our analysis could not identify an effect from floods and extreme cold in the national data. Analysing the underlying processes, we find that production losses due to droughts were associated with a reduction in both harvested area and yields, whereas extreme heat mainly decreased cereal yields. Furthermore, the results highlight ∼7% greater production damage from more recent droughts and 8-11% more damage in developed countries than in developing ones. Our findings may help to guide agricultural priorities in international disaster risk reduction and adaptation efforts.
Article
Full-text available
Searches for grey literature can require substantial resources to undertake but their inclusion is vital for research activities such as systematic reviews. Web scraping, the extraction of patterned data from web pages on the internet, has been developed in the private sector for business purposes, but it offers substantial benefits to those searching for grey literature. By building and sharing protocols that extract search results and other data from web pages, those looking for grey literature can drastically increase their transparency and resource efficiency. Various options exist in terms of web-scraping software and they are introduced herein.
Article
Full-text available
The paper is focused on the results of testing web scraping techniques in the field of consumer price surveys with specific reference to consumer electronics products (goods) and airfares (services). The paper takes as starting point the work done by Italian National Statistical Institute (Istat), in the context of the European project "Multipurpose Price Statistics" (MPS). Among the different topics covered by MPS are the modernization of data collection and the use of web scraping techniques. Included are the topic of quality (in terms of efficiency and reduction of error) and some preliminary comments about the usability of big data for statistical purposes. The general aims of the paper are described in the introduction (Section 1). In Section 2 the choice of products to test web scraping procedures are explained. In Sections 3 and 4, after a description of the survey for consumer electronics and airfares, the results and/or the issues of testing web scraping techniques are conveyed and discussed. Section 5 stresses some comments about the possible improvements in terms of quality deriving from web scraping for inflation measures. Some conclusive remarks (in Section 6) are drawn with a specific attention to big data issue. In two fact boxes centralised collection of consumer prices in Italy and the IT solutions adopted for web scraping are presented.
Article
Full-text available
Extreme haze episodes repeatedly shrouded Beijing during the winter of 2012–2013, causing major environmental and health problems. To better understand these extreme events, we performed a model-assisted analysis of the hourly observation data of PM2.5 and its major chemical compositions. The synthetic analysis shows that (1) the severe winter haze was driven by stable synoptic meteorological conditions over northeastern China, and not by an abrupt increase in anthropogenic emissions. (2) Secondary species, including organics, sulfate, nitrate, and ammonium, were the major constituents of PM2.5 during this period. (3) Due to the dimming effect of high loading of aerosol particles, gaseous oxidant concentrations decreased significantly, suggesting a reduced production of secondary aerosols through gas-phase reactions. Surprisingly, the observational data reveals an enhanced production rate of secondary aerosols, suggesting an important contribution from other formation pathways, most likely heterogeneous reactions. These reactions appeared to be more efficient in producing secondary inorganics aerosols than organic aerosols resulting in a strongly elevated fraction of inorganics during heavily polluted periods. (4) Moreover, we found that high aerosol concentration was a regional phenomenon. The accumulation process of aerosol particles occurred successively from cities southeast of Beijing. The apparent sharp increase in PM2.5 concentration of up to several hundred μg m−3 per hour recorded in Beijing represented rapid recovery from an interruption to the continuous pollution accumulation over the region, rather than purely local chemical production. This suggests that regional transport of pollutants played an important role during these severe pollution events.
Conference Paper
Full-text available
Empirical research studies are the principal mechanism through which the software engineering research community study and learn from software engineering practice. The focus on empirical studies has increased significantly in the past decade, more or less coinciding with the emergence of evidence- based software engineering, an idea that was proposed in 2004. As a consequence, the software engineering community is familiar with a range of empirical methods. However, while several overviews exist of popular empirical research methods, such as case studies and experiments, we lack a ‘holistic’ view of a more complete spectrum of research methods. Furthermore, while researchers will readily accept that all methods have inherent limitations, methods such as case study are still frequently critiqued for the lack of control that a researcher can exert in such a study, their use of qualitative data, and the limited generalizability that can be achieved. Controlled experiments are seen by many as yielding stronger evidence than case studies, but these can also be criticized due to the limited realism of the context in which they are conducted. We identify a holistic set of research methods and indicate their strengths and weaknesses in relation to various research elements.
Article
The Asia Pacific region is regarded as the most disaster-prone area of the world. Since 2000, 1.2 billion people have been exposed to hydrometeorological hazards alone through 1215 disaster events. The impacts of climate change on meteorological phenomena and environmental consequences are well documented. However, the impacts on health are more elusive. Nevertheless, climate change is believed to alter weather patterns on the regional scale, giving rise to extreme weather events. The impacts from extreme weather events are definitely more acute and traumatic in nature, leading to deaths and injuries, as well as debilitating and fatal communicable diseases. Extreme weather events include heat waves, cold waves, floods, droughts, hurricanes, tropical cyclones, heavy rain, and snowfalls. Globally, within the 20-year period from 1993 to 2012, more than 530 000 people died as a direct result of almost 15 000 extreme weather events, with losses of more than US$2.5 trillion in purchasing power parity.