Conference PaperPDF Available

Context-Based Knowledge Discovery and Querying for Social Media Data

Authors:
Context-based Knowledge Discovery and Querying for Social Media Data
Jedsada Phengsuwan1, Nipun Balan Thekkummal1, Tejal Shah1, Philip James2, Dhavalkumar Thakker3,
Rui Sun1, Divya Pullarkatt4, T. Hemalatha4, Maneesha Vinodini Ramesh4, and Rajiv Ranjan1
1School of Computing, Newcastle University, Newcastle Upon Tyne, UK
2School of Engineering, Newcastle University, Newcastle Upon Tyne, UK
3School of Electrical Engineering and Computer Science University of Bradford, Bradford, UK
4Amrita Center for Wireless Networks Applications (AmritaWNA), Amrita School of Engineering,
Amritapuri, Amrita Vishwa Vidyapeetham, India
Abstract
Modern Early Warning Systems (EWS) rely on scientific
methods to analyse a variety of Earth Observation (EO)
and ancillary data provided by multiple and heterogeneous
data sources for the prediction and monitoring of hazard
events. Furthermore, through social media, the general
public can also contribute to the monitoring by reporting
warning signs related to hazardous events. However, the
warning signs reported by people require additional pro-
cessing to verify the possibility of the occurrence of haz-
ards. Such processing requires potential data sources to
be discovered and accessed. However, the complexity and
high variety of these data sources makes this particularly
challenging. Moreover, sophisticated domain knowledge of
natural hazards and risk management are also required to
enable dynamic and timely decision making about serious
hazards. In this paper we propose a data integration and
analytics system which allows social media users to con-
tribute to hazard monitoring and supports decision making
for its prediction. We prototype the system using landslides
as an example hazard. Essentially, the system consists of
background knowledge about landslides as well as infor-
mation about data sources to facilitate the process of data
integration and analysis. The system also consists of an
interactive agent that allows social media users to report
their observations. Using the knowledge modelled within
the system, the agent can raise an alert about a potential
occurrence of landslides and perform new processes using
the data sources suggested by the knowledge base to verify
the event.
Keywords— early warning system; landslide hazard;
high variety data; IoT; ontology; data sources discovery
1 Introduction
The application of Early warning Systems (EWS) to pre-
dict natural hazards and vulnerabilities plays a vital role in
preventing loss of life and damage to property. For effective
and timely decision-making, EWS requires a strong techni-
cal underpinning and sophisticated knowledge of the natural
hazards and risk management. Landslides are a commonly
occurring natural hazard with global impact and is closely
linked to many other natural hazards such as earthquakes,
storms, flooding and volcanic eruptions. Predicting individ-
ual occurrences of landslide events is complex as it depends
upon many local factors, variables and anthropogenic input.
For predicting and monitoring landslides, decision makers
use scientific models to analyse Earth Observation (EO)
data from satellites and ancillary data produced by Internet-
of-Things (IoT) sensors deployed in landslide-prone areas.
In addition, the ancillary data includes sensor data, social
media data and data from other sources which are essential
for the prediction and monitoring of hazard events. Such
EO and sensor data used for the analysis are usually ob-
tained from multiple and heterogeneous data sources. Fur-
thermore, through social media channels (Facebook, Twit-
ter, Instagram etc.) the general public can also contribute to
landslide monitoring by reporting observations that could
be warning signs for landslides. However, decision-makers
need to verify the detected events reported from social me-
dia by analysing sensor or other corroborating data from the
area of interest. Hence, EO and data representing the event
in the area of interest is essential for the verification.
The high volume and variety of data produced by multi-

*&&&UI*OUFSOBUJPOBM$POGFSFODFPO*OGPSNBUJPO3FVTFBOE*OUFHSBUJPOGPS%BUB4DJFODF*3*
¥*&&&
%0**3*
Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
ple and heterogeneous data sources (such as sources of so-
cial media and IoT sensors data) means that the most rele-
vant sources need to be identified for analysis. A common
knowledge base that captures and formally represents the
core concepts (such as hazards, warning signs, data sources)
and the relationships between them could be utilised to dis-
cover the potential data sources for analysis is therefore a
critical component of EWS with complex data processing
requirements. Moreover, cross-domain analysis of data is
not only useful for the discovery of event co-relation but
also allows for additional event verification. The process
of landslide prediction and monitoring in EWS leads to
challenges in (i) knowledge creation and discovery —as
a formal knowledge base due to the large volume of user-
generated micro texts from social media and the exploration
of meaningful information from the knowledge base to sup-
port the decision making; (ii) ad-hoc data analysis — due
to the verification process being dependent on the context
and completeness of information provided by the users; and
(iii) data sources discovery — due to the heterogeneity of
sensor types and a wide spectrum of performance and accu-
racy characteristics in sensor data streams.
The main contribution of this work is a formal knowl-
edge base of landslide domain concepts to enable the in-
tegration of EO and ancillary data from multiple hetero-
geneous sources for dynamic verification and early pre-
diction of landslide events based on context and informa-
tion reported by social media users. The underpinning of
this knowledge base is the Landslip ontology that captures
the correlations between landslides, multi-hazards, warning
signs, sensor data and other data sources. The purpose of
the ontology is to facilitate data discovery, which will be
used to find potential data sources for landslide verification.
Another contribution of this paper is Onto-DIAS, a proto-
type system that utilises the knowledge base to support data
integration and analytics in landslide early warning applica-
tions.
The rest of this paper is organised as follows: related
work is discussed next in section 2, followed by the details
of Landslip ontology in section 3. Populating the knowl-
edge base from social media content and the ontology is
described in section 4. The system architecture and imple-
mentation are described in sections 5 and 6 respectively. Fi-
nally, the paper is concluded in section 7.
2 Related Work
In this section, we present related work on data integra-
tion and analytics, semantic web technologies, and natural
language processing in multi-hazard EWS.
2.1 Data Integration and Analytics In multi-
hazard Early Warning Systems
EWS have played a significant role in natural hazard
management to minimise loss of life and damage to prop-
erty. Additionally, several modern multi-hazard EWS take
advantage of various type of data sources, including remote
sensing satellites, IoT sensors, and social media. Due to
the heterogeneity of the data sources, data integration be-
comes a vital part of EWS to provide high-quality data for
the effective prediction of hazard events. Several works
on data integration and analysis for multi-hazard have been
proposed in the literature. A data analytic framework for
Twitter data was proposed in [35] to identify twitter mes-
sages that are related to a particular type of disaster (e.g.
earthquake, flood, and wildfire). Several methods, includ-
ing matching-based and learning-based, to identify relevant
tweets, are also evaluated. In [16], the authors describe a
study on the identification of peatland fires and haze events
in Sumatra Island, Indonesia, by using social media data. A
data classification algorithm is applied to analyse the tweets,
and the outcomes are verified by using hot spot and air qual-
ity data from NASA satellite imagery. The authors in [33]
propose an information infrastructure for timely delivery of
social media and crowd-sourcing message (from Ushahidi
platform) to potentially responsible departments during the
Haiti earthquake in 2010. A data classification algorithm
is used to provide an automatic classification mechanism
over the messages. A decision support system that inte-
grates crowd-sourcing data with Wireless Sensor Networks
(WSN) to widen the coverage of the monitoring area for
flood risk management in Brazil is proposed in [14]. The
Open Geospatial Consortium (OGC) standards are used in
the research to aid in the integration of the crowd-sourced
data.
2.2 Semantic web technologies for multi-hazards
management
Semantic Web technologies have been widely used in
hazard management to present knowledge of hazards and
their relation to EO and ancillary data. The application of
ontology for hazard management includes hazard assess-
ment and urbanisation analysis. Here, the Semantic Sen-
sor Network Ontology (SSN) [30] and the Semantic Web
for Earth and Environmental Terminology (SWEET) [32]
are two significant ontologies that can be applied for hazard
management. In [32], the authors reuse SWEET to con-
ceptualise knowledge and expertise in several areas, such as
buried assets (e.g. pipes and cables), roads, soil, the nat-
ural environment and human activities. Additionally, they
propose to use the Ontology of Soil Properties and Process
(OSP) to describe soil properties such as soil strength and

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
processes such as soil compaction. The OSP and other con-
cepts are used to express how they affect each other in asset
maintenance activities. The SSN ontology is used in [30]
and [7] for wind monitoring. The former uses SSN with On-
tology for Kinds and Units (QU) [21] to conceptualise wind
properties (e.g. wind speed and direction) whereas the latter
uses SSN and SWEET to model the concepts of wind sensor
and data streams of wind observation. In [25], the authors
extend the SSN ontology to create an ontology for land-
slides including relevant concepts such as earthquake, geo-
graphical unit, soil, precipitation, and wind. Even though
these ontologies provide general concepts for sensor data
and hazard event, they cannot describe concepts of human
sensors (e.g. social media data). Further processes are
needed when applying these ontologies to EWS for multi-
hazard applications.
2.3 Natural Language Processing
Natural Language Processing (NLP) is a set of infor-
mation engineering techniques which enables computers
to process and make sense of human (natural) languages.
NLP technique has evolved from complex handwritten rules
to models trained using machine learning. Earlier ma-
chine learning techniques like decision trees[4, 23] gener-
ated rules similar to handwritten ones, using machine learn-
ing. The application of NLP in this work is to extract
useful information from the natural language and to clas-
sify the content into different topics of interest. Language
modelling techniques apply probability distribution over a
sequence of words. Unigram, n-gram [6, 8], Exponential
and Neural networks [26, 5, 18] are the main types of lan-
guage models in use. Recent studies promise high accu-
racy in classifying natural language using a neural network.
A unified architecture for NLP using deep learning tech-
nique has been introduced in work [9] by NEC Labs. In this
work, the input sentence can be processed to perform part-
of-speech tagging, chunking, named entity tags, semantic
roles etc. using a language model and CNN. A study [17] at
New York University reveals a series of experiments using
Convolutional Neural Network (CNN) which is trained on
a pre-trained model of word vectors for sentence classifica-
tion in which the model showed significant improvement in
performance in several NLP tasks. Over the years several
open-source NLP projects like NLTK[22], CoreNLP[24],
Spacy[1], GATE[11] etc. gained interest of both academia
and industry. While these methods and tools support nat-
ural language processing, building knowledge from natural
language pose several challenges.
3 Landslip Ontology
Earth Observation (EO) and ancillary data provided by
multiple data sources contribute to the high variety of char-
acteristics of data sources in EWS. In addition, such data
sources differ in terms of: (i) data type — the different types
of EO and ancillary data used in a particular analysis, e.g.
soil properties, temperature, humidity; (ii) data storage —
the difference in methods for collecting and organising data,
e.g. RDBMS, NoSQL database, and distributed file sys-
tem; and (iii) data access — data sources are accessible by
different methods, ranging from direct access through data
stores (e.g. JDBC) to standard Web Services (e.g. OGC,
SOAP, Restful). These differences make the discovery, ac-
cess and integration of data in EWS quite challenging. A
formal semantic representation of the data sources, domain
knowledge about natural hazards and the relationship be-
tween them can help address the challenges arising from
such differences and enable data integration and analysis.
The knowledge base for data integration and analysis is
designed and developed based on the principles of Seman-
tic Web Technology [38]. A core component of the knowl-
edge base is an ontology, which can be can be defined as ”a
formal, explicit specification of a shared conceptualisation”
[34]. That is, an ontology models the agreed knowledge
about the real world through explicitly defined concepts and
constraints on them that are machine readable. In this work,
we have developed an ontology, namely the Landslip Ontol-
ogy, which is based on OWL 2 [37]. The Landslip Ontology
contains knowledge on the relationships between landslides
and data sources for EO and other data. Additionally, it
also contains the domain knowledge for describing the in-
teraction of landslide event to other hazards and the warning
signs, which can be precursors to landslides. Figure 1 shows
the steps of the Landslip ontology development. The pri-
mary knowledge sources for designing the ontology came
from interviews with four scientists and experts in land-
slide hazard management with an average of 10 years of
experience between them. Besides, publications [12], [13]
and standard specifications [30, 32, 27, 28, 29] involving
multi-hazards and geospatial data models were also used
as knowledge sources to design the ontology. A scenario-
based approach [19] was used to define a narrative repre-
senting expected uses of an EWS in the domain of interest
from the viewpoint of both domain experts as well as on-
tology developers. The scenario helps to define the scope
of the domain ontology to be designed and frame compe-
tency questions to model the domain knowledge as well as
for evaluating the ontology [3, 36]. The Landslip ontol-
ogy is developed and imported to a triple store as a model
for building the knowledge base for Landslide multi-hazard
EWS. Finally, a knowledge API is developed, which pro-
vides the consumer with an access point to the knowledge

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
 




!

!


 

!
 

 

Figure 1. Landslip Ontology development
process
base.
Figure 2 depicts a section of the Landslip Ontology con-
sisting of domain concepts and the relationships between
the concepts. The ontology is comprised of two modules:
(i) Landslide Hazard Ontology – defines concepts about
landslides and the interaction to other hazards and corre-
sponding warning signs; and (ii) Data Sources Ontology
– defines concepts about observation and data sources for
landslide hazard risk assessment. The Landslip ontology
reuses the SSN ontology as well as terminologies defined in
OGC standards (e.g. Observation and Measurement [27],
SensorML [28] and SOS [29]).
The ontology was evaluated for consistency and correct-
ness through competency questions and a set of synthesised
data that represent the use case(s) of landslide hazard. The
competency questions were defined in SPARQL [39] and
the knowledge base queried for answers. An example com-
petency questions is ”What are data sources and their meta-
data to observe a set of hazards H1, H2, Hn” and the corre-
sponding SPARQL query to answer this question is defined
as follows:
SELECT ?hazard ?observation
?observedProperty
?dataSource ?metadata
?profile ?predicate ?_value
WHERE {
?observation :isObservationFor
?hazard .
?observedProperty
:isObservedPropertyFor ?observation .
?dataSource :isDataSourceFor
?observedProperty .
?dataSource :hasDataSourceMetadata
?metadata . ?metadata :hasProfile
?profile . ?profile ?predicate ?_value .
VALUES (?hazard)
{ (:flood_1) (:landslide_1) } .
FILTER (?p != rdf:type)
}
4 Populating the Knowledge Base from So-
cial Media Content
The Landslip Ontology is a conceptual model that for-
mally represents domain knowledge about landslides cap-
tured from domain experts of natural hazards management.
Th ontology consists of concepts and relationships but does
not model concrete objects or named individuals, that rep-
resent actual events of landslides. With the emergence of
social media as a potential resource to build the domain
knowledge, social media contents to represent actual events
of landslides are dynamically instantiated within the ontol-
ogy. In order to do this however, sophisticated techniques
are required to understand the context of the social media
content and extract information from the content to cre-
ate individuals based on the conceptual model. Figure 3
shows the process of populating the knowledge base from
social media content to facilitate landslide-related knowl-
edge discovery in an EWS. Historical data related to past
events of landslides collected from social media platforms
are also added to the knowledge base. The knowledge
base thus consists of a set of synthesised data for hazards
(landslides, flood), warning signs (e.g. leaning light pole,
blocked road), and EO and ancillary data (e.g. water poten-
tial, moisture and temperature). Due to the wide variety of
information provided in each text from the collected social
media data, classification of the text into the topics of inter-
est and extraction of useful information from the content are
required. One of the most critical challenges while dealing
with user-generated content is to capture the semantics of
the content using Natural Language Processing (NLP)/ Nat-
ural Language Understanding (NLU) techniques. Our pro-
totype system is designed with two modules to achieve this
task: A data classification/topic detection module for social
media content classification and a data extraction module to
extract useful information, which can be used for instanti-
ating objects in the ontology. The modules are described
next.
4.1 Social Media Content Classification
The first step to process the user text is to identify the
hazard which the user is referring to on social media. Recur-
rent Neural Network (RNN) [20] such as Long Short Term
Memory networks (LSTM) and Convolutional Neural Net-
work (CNN) are widely used in text classification. In this
work, we have used CNN, a deep learning technique, [17]
to perform text classification. A model, which is custom

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Section of Landslip Ontology.













Figure 3. Process of populating the Knowl-
edge Base from social media content
trained using weather related text is used. The model train-
ing and inference processes are explained in the following
sections.
4.1.1 Model
A classification hierarchy, as shown in Figure 4, has been
defined for our prototype system. A model is trained us-
ing 1000 user-generated texts related to hazards, which are
marked for different hazard events and warning signs (for
example, Events: Flood, Heavy Rainfall, Snow etc.; Warn-
ing Signs: Leaning Light Pole, Water Discolouration etc.).
Each class has around 200 records. TensorFlow [2], an
open source library, was used for data preparation, train-
ing and inference. The model is similar to the one proposed
by Kim Yoon in his work Convolutional Neural Networks
for Sentence Classification [17], which achieved good clas-
sification performance for different text classification tasks
like sentiment analysis and is a standard baseline for new
text classification methods. The model consists of a word
embedding layer, which maps vocabulary word indices to
lower dimensional vector spaces. The convolutional layer
calculates convolutions over the embedded word vectors us-
ing different filter sizes as each convolution produces ten-
sors of different shapes followed by max-pooling, which is
a sampling-based discretization process. These vectors are
later merged to form a large feature vector. Full details of
the CNN layers and training process are beyond the scope
of this paper.
4.1.2 Inference
Every text message received by the Landslip agent is passed
to the classifier, which outputs the hazard event and any
warning sign mentioned in the text. This step enables the
system to understand the topic from the user-generated con-
tent. Data classification for this system is a two-step process
involving the classification of hazards and warning signs.
In the first step, the classifier tags the message whether a
hazard or warning sign is present in the text. The second
step involves two classifiers, one for classifying the type of
hazard and the second for categorising the kind of warn-
ing sign as per the classification hierarchy. In some cases,
a message may contain information about both hazard and
warning sign. In such a scenario, the system passes this
message to both classifiers. In the inference step, the result
is attached as metadata to the input text.
4.2 Information Extraction and Annotation
In this step, a scenario is developed using information
about the situation from user-provided text. NLP tech-
niques, namely Part of Speech (PoS) tagging and Named
Entity Recognition (NER) are used to extract useful in-
formation from the text. Pre-trained NLP models for the

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
Source Text
Other
Hazards
Landslide
Warning Signs
Increase in
Water Level
Flood Leaning
Telephone Pole
Figure 4. Data Classification Hierarchy
English language recognise Geo-location and affected en-
tities (for example, Road, Building, Electric Pole, etc.).
The English language model,a multi-task CNN trained on
OntoNotes[15], with GloVe[31] vectors trained on Com-
mon Crawl[10] is used in this work. This model is built
for assigning word vectors, context-specific token vectors,
POS tags, dependency parse and named entity recognition.
Tokenizer PoS
Tagger Parser NER
Text
Text
with
Entity Tags
Figure 5. NLP Pipeline for Named Entity
Recognition
As mentioned in Section 2, named entities are extracted
from the user-generated content using NER. An NLP tool
called Spacy [1] was used to perform the series of tasks re-
quired to perform NER. The processing pipeline consists of
a tokenizer, PoS tagger, Parser, and NER. The tokenizer to-
kenizes sentences into words for which a PoS tag is attached
based on the sentence structure. Then the parser performs
a dependency parsing of the sentence, which represents its
grammatical structure and defines the relationship between
words. This step is followed by the NER phase, which iden-
tifies the type of entity such as geo-location, person, organ-
isation, physical object, date, time, building/infrastructure
etc. This model and pipeline gave a largely accurate predic-
tion of the type of the entity from the noun words tagged
from the PoS tagging phase. Figure 6 shows an example of
a sentence being processed and labelled. The entity tags are
I saw leaning poles near Hill Cart Roadat 8.00 PM
I saw leaning poles near Hill Cart Road at 8.00 PM
ENTITY:FACILITY
ENTITY:GEOLOCATION
ENTITY:TIME
ENTITY:EVENT
ENTITY:OBJECT
NER
ENTITY:PERSON
Keywords &Concepts
Figure 6. Named Entity Recognition Example
attached to the original data as metadata for storage and in-
dexing. This extracted information is instantiated as objects
based on the concepts defined in the ontology and stored in
the knowledge base.
5 Onto-DIAS Architecture
The main aim of the prototype, Onto-DIAS, is to pro-
vide a comprehensive suite of services to enable multi-
hazard early warning through timely detection of warning
signs. Incidents reported by social media users are used as
an initial input to detect warning signs for natural hazards.
However, additional processes are required to verify or pre-
dict the potential of hazards. Data sources discovery and
data access are essential functions for searching, acquiring
and integrating EO and ancillary data from potential data
sources to be used in the processing. Onto-DIAS system
utilises Landslip ontology and its knowledge base to facili-
tate data integration over multiple data sources.
Figure 7 (a) presents the architecture of Onto-DIAS,
which comprises of the following components:
Data sources — Data sources collect EO and ancillary
data from physical sensors deployed in the landslide-
prone area. These sensors observe or measure prop-
erties of landslide and other earth observations, which
can be used to warn about the potential occurrence of
landslides.
Knowledge APIs — Knowledge APIs are RESTful
Web Services that provide access to the landslide
knowledge base to support data integration and anal-
ysis. The APIs comprise of 3 main APIs: (i) hazard
knowledge API — to answer competency questions
which are defined for the Landslip Ontology. This
API also provides a function for hazard and warn-
ing sign indicators from social media input. It is im-
plemented using CNN to create a data classification
model with social media data used for the training
phase; (ii) data sources discovery API — provides a
function for searching potential data sources; and (iii)
data access API — This API utilises information given
by the data sources discovery API to access and query
data from multiple data sources and provide the data
for the data analytics process.
Landslip agent — The Landslip agent communicates
with users via a social media platform to receive infor-
mation about landslide-related incidences from them.
An agent for the Facebook platform was developed
for Onto-DIAS. The agent uses the knowledge API to
understand text messages from users. It then decides
whether to ask users for additional relevant informa-
tion or execute analytic processes to verify or predict
the occurrence of a landslide.
User application — A User application allows social
media users to report hazard-related incidences and ad-
ditional information such as observed date, location
etc. For Onto-DIAS, Facebook Messenger was used

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
(a) (b)
Figure 7. (a) Onto-DIAS Architecture; (b) A communication between social media user and Onto-DIAS
as the user application to enable social media users to
communicate with the Landslip agent.
6 Implementation
Onto-DIAS system is developed and deployed in a cloud
server. Some samples of EO and urban data are stored in
different data stores (e.g., MySQL, MongoDB, and HDFS)
to demonstrate the heterogeneity of data sources and to con-
duct experiments. Communication between social media
users and Onto-DIAS is demonstrated using a mobile ver-
sion of the Facebook Messenger application. A Facebook
group, Landslip, was created for the experiment. Social me-
dia users can search and join the group to contribute by re-
porting warning signs related to landslides. When the agent
receives a message from an individual user via the Facebook
group, it starts communication to acquire more information
and perform data analysis to conduct landslide events ver-
ification. Figures 7 (b) depicts examples of communica-
tion between a social media user and Onto-DIAS. A user
sends a text message via the Landslip Facebook group to
report an observed incidence. On receiving the message,
Onto-DIAS agent accesses the knowledge base through the
knowledge APIs to indicate related hazard events and re-
spond to the user. If the report from the user involves a
hazard, the agent will ask for additional information related
to the incidence (e.g. observed date and location). Once
the agent obtains enough information, the knowledge base
is utilised to search for potential data sources related to the
reported incidence. The list of data sources suggested from
the knowledge base is returned and accessed by Onto-DIAS
to perform data analysis for hazard event verification. As a
result, the agent returns the data analysis output to the user
along with further information depending on the analysis
models.
7 Conclusion and Future Work
A comprehensive set of EO and ancillary data from mul-
tiple and heterogeneous data sources is essential for an
effective early warning system (EWS). In this paper, we
demonstrated the application of ontology-based data inte-
gration and analysis for landslide reporting and detection
through the development of a prototype Early Warning Sys-
tem (EWS), Onto-DIAS. Onto-DIAS provides a knowledge
base for the landslide domain, which can be utilised for dis-
covering potential data sources to verify the possibility of
the occurrence of landslide based on the information re-
ported by social media users. Using a social media plat-
form, an agent interactively communicates with social me-
dia users who want to report any landslide warning sign
they observed. This allows the EWS to make dynamic and
timely decision regarding the possibility of the occurrence
of landslide. Our immediate future work on this research is
to evaluate Onto-DIAS by implementing in a real applica-
tion of landslide EWS.
8 Acknowledgments
This research is partially supported by two Natural En-
vironment Research Council projects including LandSlip
(NE/P000681/1) and FloodPrep (NE/P017134/1).
References
[1] spacy - industrial-strength natural language processing in
python. https://spacy.io/.

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
[2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, and Z. Chen.
TensorFlow: Large-scale machine learning on heteroge-
neous systems. https://www.tensorflow.org/, 2015. Software
available from tensorflow.org.
[3] M. B. Almeida and R. R. Barbosa. Ontologies in knowl-
edge management support: A case study. Journal of the
American Society for Information Science and Technology,
60(10):2032–2047.
[4] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer-
cer. A tree-based statistical language model for natural lan-
guage speech recognition. IEEE Transactions on Acoustics,
Speech, and Signal Processing, 37(7):1001–1008, 1989.
[5] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural
probabilistic language model. Journal of machine learning
research, 3(Feb):1137–1155, 2003.
[6] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and
J. C. Lai. Class-based n-gram models of natural language.
Computational linguistics, 18(4):467–479, 1992.
[7] J.-P. Calbimonte, H. Jeung, O. Corcho, and K. Aberer. Se-
mantic sensor data search in a large-scale federated sensor
network. Proceedings of the 4th International Workshop on
Semantic Sensor Networks, 839:23–38, 2011.
[8] W. B. Cavnar, J. M. Trenkle, et al. N-gram-based text cate-
gorization. In Proceedings of SDAIR-94, 3rd annual sympo-
sium on document analysis and information retrieval, vol-
ume 161175. Citeseer, 1994.
[9] R. Collobert and J. Weston. A unified architecture for natural
language processing: Deep neural networks with multitask
learning. In Proceedings of the 25th international confer-
ence on Machine learning, pages 160–167. ACM, 2008.
[10] Common Crawl. Common Crawl. http://commoncrawl.org/.
[11] H. Cunningham, V. Tablan, A. Roberts, and K. Bontcheva.
Getting more out of biomedical documents with gate’s full
lifecycle open source text analytics. PLoS computational
biology, 9(2):e1002854, 2013.
[12] J. Gill and B. Malamud. Hazard interactions and inter-
action networks (cascades) within multi-hazard methodolo-
gies. Earth System Dynamics, 7(3):659–679, 8 2016.
[13] J. C. Gill and B. D. Malamud. Anthropogenic processes, nat-
ural hazards, and interactions in a multi-hazard framework.
Earth-Science Reviews, 166:246 – 269, 2017.
[14] F. E. Horita, J. a. P. d. Albuquerque, L. C. Degrossi, E. M.
Mendiondo, and J. Ueyama. Development of a spatial deci-
sion support system for flood risk management in brazil that
combines volunteered geographic information with wireless
sensor networks. Comput. Geosci., 80(C):84–94, July 2015.
[15] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and
R. Weischedel. Ontonotes: The 90\% solution. In Pro-
ceedings of the human language technology conference of
the NAACL, Companion Volume: Short Papers, 2006.
[16] M. Kibanov, G. Stumme, I. Amin, and J. G. Lee. Mining
social media to inform peatland fire and haze disaster man-
agement. CoRR, abs/1706.05406, 2017.
[17] Y. Kim. Convolutional neural networks for sentence classi-
fication. arXiv preprint arXiv:1408.5882, 2014.
[18] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-
aware neural language models. In Thirtieth AAAI Confer-
ence on Artificial Intelligence, 2016.
[19] H. Knublauch. Ontology-driven software development in
the context of the semantic web: An example scenario with.
In in Annex XVII (7) , and, 2004.
[20] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional
neural networks for text classification. In Twenty-ninth AAAI
conference on artificial intelligence, 2015.
[21] A. Laurent Lefort (CSIRO and W. S. S. N. I. Group). Ontol-
ogy for quantity kinds and units: units and quantities defini-
tions. https://www.w3.org/2005/Incubator/ssn/ssnx/qu/qu-
rec20.html, 2010.
[22] E. Loper and S. Bird. Nltk: the natural language toolkit.
arXiv preprint cs/0205028, 2002.
[23] D. M. Magerman. Statistical decision-tree models for pars-
ing. In Proceedings of the 33rd annual meeting on Associa-
tion for Computational Linguistics, pages 276–283. Associ-
ation for Computational Linguistics, 1995.
[24] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard,
and D. McClosky. The stanford corenlp natural language
processing toolkit. In Proceedings of 52nd annual meet-
ing of the association for computational linguistics: system
demonstrations, pages 55–60, 2014.
[25] H. Michels and P. Mau´
e. Semantics for notifying events in
the affecting environment. In EnviroInfo, 2010.
[26] T. Mikolov, M. Karafi ´
at, L. Burget, J. ˇ
Cernock`
y, and S. Khu-
danpur. Recurrent neural network based language model. In
Eleventh annual conference of the international speech com-
munication association, 2010.
[27] OGC. Observations and measurements, 2011.
[28] OGC. Sensor model language (sensorml), 2011.
[29] OGC. Sensor observation service, 2011.
[30] W. OGC. Semantic sensor network ontology.
https://www.w3.org/TR/vocab-ssn/, 2016.
[31] J. Pennington, R. Socher, and C. Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language pro-
cessing (EMNLP), pages 1532–1543, 2014.
[32] R. G. Raskin and M. J. Pan. Knowledge representation in
the semantic web for earth and environmental terminology
(sweet). Computers & Geosciences, 31(9):1119 – 1125,
2005. Application of XML in the Geosciences.
[33] H. Shen. Discussion and analysis of the crowdsourcing
mode of public participation in emergency management. In
2015 8th International Symposium on Computational Intel-
ligence and Design (ISCID), volume 2, pages 610–613, Dec
2015.
[34] R. Studer, V. R. Benjamins, and D. Fensel. Knowledge
engineering: Principles and methods. Data Knowl. Eng.,
25:161–197, 1998.
[35] H. To, S. Agrawal, S. H. Kim, and C. Shahabi. On iden-
tifying disaster-related tweets: Matching-based or learning-
based? CoRR, abs/1705.02009, 2017.
[36] M. Uschold and M. Gruninger. Ontologies: principles,
methods and applications. The Knowledge Engineering Re-
view, 11(2):93136, 1996.
[37] W3C. Owl 2 web ontology language primer (second edi-
tion), 2012.
[38] W3C. W3c semantic web activity, 2013.
[39] W3C. Web ontology language (owl), 2013.

Authorized licensed use limited to: Newcastle University. Downloaded on September 28,2020 at 23:43:52 UTC from IEEE Xplore. Restrictions apply.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster management. To this end, we compare insights from two datasets: fire hotspots detected via NASA satellite imagery and almost all GPS-stamped tweets from Sumatra Island, Indonesia, posted during 2014. Sumatra Island is chosen as it regularly experiences a significant number of haze events, which affect citizens in Indonesia as well as in nearby countries including Malaysia and Singapore. We analyse temporal correlations between the datasets and their geo-spatial interdependence. Furthermore, we show how Twitter data reveals changes in users' behavior during severe haze events. Overall, we demonstrate that social media is a valuable source of complementary and supplementary information for haze disaster management. Based on our methodology and findings, an analytics tool to improve peatland fire and haze disaster management by the Indonesian authorities is under development.
Article
Full-text available
Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such information requires timely selection and analysis of tweets that are relevant to a particular disaster. Even though abundant tweets are promising as a data source, it is challenging to automatically identify relevant messages since tweet are short and unstructured, resulting to unsatisfactory classification performance of conventional learning-based approaches. Thus, we propose a simple yet effective algorithm to identify relevant messages based on matching keywords and hashtags, and provide a comparison between matching-based and learning-based approaches. To evaluate the two approaches, we put them into a framework specifically proposed for analyzing disaster-related tweets. Analysis results on eleven datasets with various disaster types show that our technique provides relevant tweets of higher quality and more interpretable results of sentiment analysis tasks when compared to learning approach.
Article
Full-text available
This paper presents a broad overview, characterisation and visualisation of the role of 18 anthropogenic process types in triggering and influencing 21 natural hazards, and natural hazard interactions. Anthropogenic process types are defined as being intentional, non-malicious human activities. Examples include groundwater abstraction, subsurface mining, vegetation removal, chemical explosions and infrastructure (loading). Here we present a systematic classification of anthropogenic process types, organising them into three groups according to whether they are subsurface processes, surface processes, or both. Within each group we identify sub-groups (totalling eight): subsurface material extraction, subsurface material addition, land use change, surface material extraction, surface material addition, explosions, hydrological change, and fires. We use an existing classification of 21 natural hazards, organised into six hazard groups (geophysical, hydrological, shallow Earth processes, atmospheric, biophysical and space hazards). Examples include earthquakes, landslides, floods, regional subsidence and wildfires. Using these anthropogenic process types and natural hazards we do the following: (i) Describe and characterise 18 anthropogenic process types. (ii) Identify 64 interactions that may occur between two different anthropogenic processes, which could result in the simultaneous or successive occurrence of an ensemble of different anthropogenic process types. (iii) Identify, through an assessment of >120 references, from both grey- and peer-review literature, 57 examples of anthropogenic processes triggering natural hazards, citing location-specific case studies for 52 of the 57 identified interactions. (iv) Examine the role of anthropogenic process types (we use as an example vegetation removal) catalysing or inadvertently impeding a given natural hazard interaction, where the impedance of natural hazard interactions does not include deliberate hazard reduction activities (e.g., engineered defences). Through (i)–(iii) above, this study aims to enable the systematic integration of anthropogenic processes into existing and new multi-hazard and hazard interaction frameworks. As natural hazards occur within an environment shaped by anthropogenic activity, it is argued that the consideration of interactions involving anthropogenic processes is an important component of an applied multi-hazard assessment of hazard potential.
Article
Full-text available
This paper combines research and commentary to reinforce the importance of integrating hazard interactions and interaction networks (cascades) into multi-hazard methodologies. We present a synthesis of the differences between multi-layer single-hazard approaches and multi-hazard approaches that integrate such interactions. This synthesis suggests that ignoring interactions between important environmental and anthropogenic processes could distort management priorities, increase vulnerability to other spatially relevant hazards or underestimate disaster risk. In this paper we proceed to present an enhanced multi-hazard framework through the following steps: (i) description and definition of three groups (natural hazards, anthropogenic processes and technological hazards/disasters) as relevant components of a multi-hazard environment, (ii) outlining of three types of interaction relationship (triggering, increased probability, and catalysis/impedance), and (iii) assessment of the importance of networks of interactions (cascades) through case study examples (based on the literature, field observations and semi-structured interviews). We further propose two visualisation frameworks to represent these networks of interactions: hazard interaction matrices and hazard/process flow diagrams. Our approach reinforces the importance of integrating interactions between different aspects of the Earth system, together with human activity, into enhanced multi-hazard methodologies. Multi-hazard approaches support the holistic assessment of hazard potential and consequently disaster risk. We conclude by describing three ways by which understanding networks of interactions contributes to the theoretical and practical understanding of hazards, disaster risk reduction and Earth system management. Understanding interactions and interaction networks helps us to better (i) model the observed reality of disaster events, (ii) constrain potential changes in physical and social vulnerability between successive hazards, and (iii) prioritise resource allocation for mitigation and disaster risk reduction.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Conference Paper
Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such information requires timely selection and analysis of tweets that are relevant to a particular disaster. Even though abundant tweets are promising as a data source, it is challenging to automatically identify relevant messages since tweet are short and unstructured, resulting to unsatisfactory classification performance of conventional learning-based approaches. Thus, we propose a simple yet effective algorithm to identify relevant messages based on matching keywords and hashtags, and provide a comparison between matching-based and learning-based approaches. To evaluate the two approaches, we put them into a framework specifically proposed for analyzing diaster-related tweets. Analysis results on eleven datasets with various disaster types show that our technique provides relevant tweets of higher quality and more interpretable results of sentiment analysis tasks when compared to learning approach.