Conference PaperPDF Available

Identifying Tweets from Syria Refugees Using a Random Forest Classifier

Authors:

Abstract and Figures

A social unrest and violent atmosphere can force a vast number of people to flee their country. While governments and international aid organizations need migration data to inform their decisions, the availability of this data is often delayed due to the tediousness to collect and publish this data. Recent studies recognized the increasing usage of social networking platforms amongst refugees to seek help and express their hardship during their journeys. This paper investigates the feasibility of accurately extracting and identifying tweets from Syria refugees. A robust framework has been developed to find, retrieve, clean and classify tweets from Syria. This includes the development of a Random Forest classifier, which automatically determines which tweets are from Syria refugees. Testing the classifier with samples of historical Twitter data produced promising result of 81% correct classification rate. This preliminary study demonstrates the potential that refugees' messages can be accurately identified and extracted from social media data mixed with many unwanted messages, and this enables further works for studying refugee issues and predicting their migration patterns.
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Identifying tweets from Syria refugees using a
Random Forest classifier
Smarti Reel
Health Informatics Centre
University of Dundee
Dundee, UK
s.reel@dundee.ac.uk
Patrick Wong
School of Computing and
Communications
Open University
Milton Keynes, UK
patrick.wong@open.ac.uk
Belinda Wu
School of Politics,
Philosophy, Economics,
Development, Geography
Open University, UK
belinda.wu@open.ac.uk
Soraya Kouadri Mostefaoui
School of Computing and
Communications
Open University, UK
soraya.kouadri@open.ac.uk
Haiming Liu
Department of Computer
Science and Technology
University of Bedfordshire,
Luton, UK
Haiming.Liu@beds.ac.uk
Abstract A social unrest and violent atmosphere can force a
vast number of people to flee their country. While governments
and international aid organizations need migration data to
inform their decisions, the availability of this data is often
delayed due to the tediousness to collect and publish this data.
Recent studies recognized the increasing usage of social
networking platforms amongst refugees to seek help and express
their hardship during their journeys. This paper investigates the
feasibility of accurately extracting and identifying tweets from
Syria refugees. A robust framework has been developed to find,
retrieve, clean and classify tweets from Syria. This includes the
development of a Random Forest classifier, which automatically
determines which tweets are from Syria refugees. Testing the
classifier with samples of historical Twitter data produced
promising result of 81% correct classification rate. This
preliminary study demonstrates the potential that refugees
messages can be accurately identified and extracted from social
media data mixed with many unwanted messages, and this
enables further works for studying refugee issues and predicting
their migration patterns.
Keywords Social media, Syria, refugees, classification
Short Papers, CSCI-ISNA
I. INTRODUCTION
When a large-scale disaster or conflict breaks out, vast
numbers of people in the affected areas often migrate and seek
refuges in safe havens elsewhere. Such huge sudden
movement of people can cause complex humanitarian, social
and economic issues to the refugees, the nearby and hosting
countries as well as the various governmental and aid
agencies. If the patterns of these migrations can be correctly
understood and accurately predicted, the governments and aid
agencies of the countries concerned can be better prepared to
help the refugees. Recent studies have recognized the
increasing use of social media such as Twitter and Facebook
amongst refugees/migrants [1]. They often use social
networking platforms to express the hardship and difficulties
they face and report their experience, as well as using such
platforms as a tool in migration planning and finding out
information during their actual journeys to destined host
countries, including the most up-to-date situations and
regulations on the way. This creates an opportunity for
studying large-scale international migrations using the social
media and data analytic methods, which have been
successfully applied in modelling various human behavior
such as traveling tracking [2] and Crowd dynamics [3].
However, the migration pattern predication relies heavily on
accurate identification of messages from refugees. This paper
therefore focuses on identifying social media messages from
refugees, which was confirmed as a challenging problem due
to refugees reluctant to post on open social media platforms in
the fear of getting caught by their government authority or
human smugglers [4]. To prevent refugees’ identities from
being accidentally revealed by this study, their social media
messages had been anonymized by replacing their usernames
with unique index numbers.
Recent studies have analyzed the trends in mobility and
migration flows using geolocated twitter data. For example,
Zagheni [5] investigated the use of Twitter data to analyze the
international and internal migration patterns. However, such
analysis is entirely dependent on the availability of geolocated
data, which is often unavailable in refugees’ messages due to
limitations on their computing devices or other reasons.
Gillespie [4] explored the refugee media journeys
through smartphones and social media networks. It obtained
information on trusted sources and groups on Facebook.
While some of the Facebook groups studied contained news
about refugees in Syria, other groups concerned general
migration issues. These groups only revealed partial
information about refugees’ experiences. The study also found
that most of the trusted sources communicates in Arabic.
Ali [6] discussed the significance of big data for various
applications and development purposes. It provided a brief
background of relevant techniques to understand the
applications in humanitarian development. It proposed to use
predictive analytics to avoid or mitigate humanitarian
emergencies before they happened.
Brouckman and Wang [7] investigated the use of
supervised machine learning along with Natural Language
processing methods to classify downloaded tweets (with
keyword ‘refugee in four languages) using the Twitter API.
They used unigram features to model the tweets and then
trained five classifiers namely, - Support Vector Machines,
Logistic Regression, Random Forest, Naive Bayes, and
Ensemble to predict the sentiments towards refugees as either
positive, negative, or neutral.
Despite the above research studies, accurately identifying
social media messages from refugees remains a challenge,
while this is the important step in predicting migrating patterns.
II. METHODOLOGY
This pilot study aims to identify refugees’ social media
messages from data openly available on social networking
platforms using data analytic techniques. Due to the public
nature of Twitter data, it was chosen as the social media
messages for this study. Given the large scale of human
movement happened during the recent Syria crisis, it was
chosen as an example for this pilot study. The main challenge
of identifying refugees’ messages (tweets) is that there are
vast amount of messages discussing the Syria crisis from non-
refugees such as journalists and the general public who are
concerned with the situation in Syria. These messages
compose of similar keywords to those from refugees.
Therefore the main objective of this pilot study is to develop a
robust framework for identifying tweets from Syria refugees
and this includes processes such as finding, extracting,
cleaning and labelling tweets, extracting key features from the
tweets and classifying them. For this study, Syria refugees are
defined as Syrians who are considering to leave Syria and seek
refuge elsewhere, those are on-route to their destinations or
those who have recently reached their destinations.
A. Extraction and labelling of tweets
A number of tools were available for extracting tweets
including Twitter API [8] streaming API [9], the R Project for
Statistical Computing [10] and the Rselenium [11]. The
Rselenuim was chosen as it could be modified to extract older
tweets and easy to use. A customized code for Rselenuim is
developed to extract tweets. The code utilizes cascading style
sheets (CSS) selector function to locate and extract the given
fields of tweets such as time of tweets, username and tweet
content. The tweets extracted for this pilot study were from
year 2011 to 2017 as the Syrian migration was more intense in
this period of time. The usernames of the extracted tweets
were anonymized after extraction.
The tweets were extracted based on different combinations
of keywords which were expected to be used by refugees. The
initial keywords used were based on general knowledge about
the Syria war such as ‘Syria’, ‘escape’, ‘leave’, ‘drowning’,
‘fear’, ‘asylum’ and ‘help me’. The extracted tweets were
manually checked and studied during the development phase.
When a refugee or potential refugees had identified, their
other tweets were extracted as well for further analysis.
The extracted tweets contain anonymized username, time of
tweet, retweets, replies and likes. To identify the main key
words in refugees’ tweets, the extracted tweets were used to
form a word cloud [12]. Figure 1 shows the main keywords
and their significance in the word cloud. A simple analysis
indicated that Syria refugees inclined to seek asylum in
Canada, Sweden and Germany. Word clouds could provide an
insight about the most used terminology on twitter by
refugees. Another interesting observation is that there are
common spelling errors, e.g. Germani, in the extracted tweets.
These unexpected spelling errors can make the identification
of refugees’ tweets more difficult.
Figure 1: Word cloud highlighting important keyword
associations.
Around 40 different keywords were used in various
combinations to extract relevant tweets from twitter. This
resulted in around 5000 tweets. The extracted tweets were
cleaned by removing all non-English words such as URLs,
usernames, punctuation, extra white space and symbols. For
the purpose of creating a training set for identifying tweets
from refugees, the cleaned tweets were manually classified
into two classes and labelled as: ‘refugee’ and ‘commentator’.
The tweets labelled as “refugee” are those whose authors
explicitly express their own or family’s desire to escape from
Syria, on- route to their destination country or reached their
destination recently. The commentators are non-refugees who
have an opinion on the Syria crisis, e.g. tweets from
journalists and general public from outside of Syria. The
training set contains 212 tweets, which were randomly chosen
with an aim of having a roughly equally representatives from
each of the two classes. These tweets were mostly from
different authors. Out of the 212 tweets, 93 were from
commentators, 119 from refugees. The tweets were labelled
based on common sense. For example, if a person is showing
an intention to leave a country, they will be marked as
“refugee”.
B. Feature Extraction and classification
The key features of the cleaned tweets were extracted using
the Bag of Words (BoW) method [13]. BoW is a natural
language processing technique and can be used to categorize
textual information and represent it as an unordered set of
words with their respective frequencies. Each frequency of
word is used as a feature. A total 107 features were extracted
from the 212 tweets. The 107 BoW features along with 8
different metadata, such as the number of followers, retweets
and likes, form a total of 115 features which were used for
training a classifier.
A number of classification tools were considered and tested
but the Random forest classifier was chosen for this study
mainly due to the fact that it has the ability to compensate for
overfitting to their training set [14]. The performance of the
classifier is validated using the 5-fold cross validation (CV)
due to its robustness. In 5-fold CV, the original dataset is
randomly partitioned into 5 subsets, where a single subset is
retained as the validation data for testing and the remaining 4
subsets are used as training data. The CV process is then
repeated 5 times with each of the 5 subsets used exactly once
as the validation data. The results from the folds are averaged
to produce a single estimation [15].
III. EXPERIMENT RESULTS AND DISCUSSIONS
The classification results of the Random Forest classifier using
the 5-fold CV are that 140 of the 212 were correctly classified
which resulted in 66% classification accuracy.
To further analyze the classification performance, the
confusion matrix was used to cross-check the correctly and
incorrectly classifications, as shown in Table 1. The diagonal
values in these matrices signify the correctly classified tweets
while the other adjacent values highlight the incorrectly
classified tweets.
Table 1 Confusion Matrix for 5-fold Cross Validation.
Predicted by Classifier
Commentator
Refugee
Total
Manual
Labelled
Commentator
40 (43%)
53 (57%)
93
Refugee
19 (16%)
100 (84%)
119
Total
59
153
212
It is observed that the classifier performed well on classifying
tweets from refugees, with an accuracy of 84%. As for tweets
from commentators, the performance of the classifier was
moderate, with 43% accuracy, whilst 53% of them were
misclassified as ‘refugees’. It was believed the high
misclassification rate was caused by the imbalance number of
representative samples between two classes, i.e., more tweets
from refugees than commentators. To verify this hypothesis,
the Random-forest classifier was trained again but adopting a
resampling method, which ensure the number of
representative samples from each class is exactly the same by
randomly duplicating some samples from the class with fewer
samples and randomly removing some samples from the class
with higher number of samples. The results of the
classification with resampling data is shown in the confusion
matrix in Table 2. Overall, 171 out 212 tweets were correctly
classified, giving an average classification rate of 81%. For
tweets from refugees, 77% of them were correctly classified,
while 23% of them were misclassified as tweets from
commentators. For tweets from commentators, 89% of them
were correctly classified, but 17% of them were misclassified
as tweets from refugees.
Table 2 Confusion Matrix for 5-fold Cross Validation.
Predicted by Classifier
Commentator
Refugee
Total
Manual
Labelled
Commentator
89 (84%)
17 (16%)
106
Refugee
24 (23%)
82 (77%)
106
Total
113
99
212
From these preliminary results, it appears the random-forest
classifier trained with data resampling gives significant better
overall classification results. Comparing with the classifier
that trained with the original data (i.e., more refugee samples),
the classifier trained with data resampling significantly
reduced misclassifications of commentators’ tweets as
refugees’ (from 57% to 17%), but also mildly increase
misclassifications of refugees’ tweets as commentators (from
19% to 24%). The reason for this behavior could be due to the
fact that insufficient unique features in the training samples of
two classes and this makes them difficult to be well
distinguished from each other. This could be caused by the
overlapping of key words in the two classes of tweets. Figure
2(a) and (b) shows the word clouds of the tweets from
refugees and commentators respectively. It is observed that
many keywords appear in both word clouds, e.g. Syria, Syrian,
refuge, leav and famili. Further study is needed on the search
terms and feature extraction.
(a)
(b)
Figure 2: Word clouds showing the keywords and their
significances in the tweets from (a) refugees and (b)
commentators.
IV. Conclusion and Future Work
A robust framework was developed to find, extract, clean and
label tweets related to Syrian refugees based on key words. A
Random Forest classifier was developed and tested using the
labelled tweets. Experiment result shows the developed
classifier achieved an overall average of 81% classification
rate. The results suggest that the developed classifier can be
used to reliably extract refugees’ tweets from data mixed with
many unwanted commentators’ tweets. This study concludes
that applying suitable data analytic techniques on information
extracted from social media platforms such as twitter has a
great potential for identifying refugees messages, which has
previously been identified as a difficult task. With further
research on classifying refugees into those sub-classes such as
those intends to leave their home country, on-route to their
destination and reached their destinations recently and
working out their current location from their messages, these
techniques together can potentially predict the migration
patterns of the refugees effectively.
As for future work, it was found that some refugees sought
helps from famous figures, who sometimes lent their popular
social media accounts to refugees such that they can publicize
their stories, so it is suggested to broaden the search term to
include the messages from pro-refugee famous figures.
Another observation was that many refugees restrained from
openly posting or disclosing their location information on
public platforms such as Twitter because of fears from getting
caught by human smugglers or their government agencies.
However, they were willing to share their whereabouts on
trusted close groups on other social media networks such as
Facebook. It is therefore suggested to seek permission to
access and analyse messages in closed groups for refugees.
However, safeguard must be in place to protect refugees’
identities, e.g. anonymize their messages. Furthermore, many
messages from Syria refugees were in Arabic. It is therefore
suggested to translate the Arabic message into English and
include them for analysis.
REFERENCES
[1] Cassar, C.M., Gauci, J.P. and Bacchi, A. 2016. Migrants’ Use of Social
Media in Malta. The People for Change Foundation.
[2] Chen, C., Ma, J., Susilo, Y., Liu, Y., Wang, M., 2016. The promises of
big data and small data for travel behavior (aka human mobility)
analysis. Transp. Res. Part C Emerg. Technol. 68, 285299.
https://doi.org/10.1016/j.trc.2016.04.005
[3] Bellomo, N., Clarke, D., Gibelli, L., Townsend, P., Vreugdenhil, B.J.,
2016. Human behaviours in evacuation crowd dynamics: From
modelling to “big data” toward crisis management. Phys. Life Rev. 18,
121. https://doi.org/10.1016/j.plrev.2016.05.014Ali, A., Qadir, J.,
Rasool, R. ur, Sathiaseelan, A., Zwitter, A., Crowcroft, J., 2016. Big
data for development: applications and techniques. Big Data Anal. 1, 2.
https://doi.org/10.1186/s41044-016-0002-4
[4] Gillespie, M., Ampofo, L., Cheesman, M., Faith, B., Iliadou, E., Issa, A.,
Osseiran, S., Skleparis, D., 2016. Mapping Refugee Media Journeys:
Smartphones and Social Media Networks. The Open University / France
Médias Monde.
[5] Zagheni, E., Garimella, V.R.K., Weber, I., State, B., 2014. Inferring
International and Internal Migration Patterns from Twitter Data, in:
Proceedings of the 23rd International Conference on World Wide Web,
WWW ’14 Companion. ACM, New York, NY, USA, pp. 439444.
https://doi.org/10.1145/2567948.2576930
[6] Ali, A., Qadir, J., Rasool, R. ur, Sathiaseelan, A., Zwitter, A., Crowcroft,
J., 2016. Big data for development: applications and techniques. Big
Data Anal. 1, 2. https://doi.org/10.1186/s41044-016-0002-4
[7] Brouckman, L., Wang, A. 2017. Analyzing Twitter Sentiment On the
Refugee Crisis in 2016.
[8] Jäderberg, D. 2016. Sentiment and topic classification of messages on
Twitter and using the results to interact with Twitter users. Uppsala
Universitet.
[9] Wickham, H., 2016. rvest: Easily Harvest (Scrape) Web Pages.
[10] R: The R Project for Statistical Computing [WWW Document], n.d.
URL https://www.r-project.org/ (accessed 11.13.17).
[11] Harrison, J. 2017. RSelenium: R Bindings for “Selenium WebDriver”.
[12] Lohmann, S., Ziegler, J., Tetzlaff, L., 2009. Comparison of Tag Cloud
Layouts: Task-Related Performance and Visual Exploration, in: Human-
Computer Interaction INTERACT 2009, Lecture Notes in Computer
Science. Presented at the IFIP Conference on Human-Computer
Interaction, Springer, Berlin, Heidelberg, pp. 392404.
https://doi.org/10.1007/978-3-642-03655-2_43
[13] Zhang, Y., Jin, R., Zhou, Z.-H., 2010. Understanding bag-of-words
model: a statistical framework. Int. J. Mach. Learn. Cybern. 1, 4352.
https://doi.org/10.1007/s13042-010-0001-0Arlot, S., Celisse, A., 2010.
A survey of cross-validation procedures for model selection. Stat. Surv.
4, 4079. https://doi.org/10.1214/09-SS054
[14] James, G., Witten, D., Hastie, T., Tibshirani, R., 2017. An Introduction
to Statistical Learning: with Applications in R, 1st ed. 2013, Corr. 7th
printing 2017 edition. ed. Springer, New York.
[15] Arlot, S., Celisse, A., 2010. A survey of cross-validation procedures for
model selection. Stat. Surv. 4, 4079. https://doi.org/10.1214/09-SS054
[16] Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., 2016. Data Mining:
Practical Machine Learning Tools and Techniques, 4 edition. ed.
Morgan Kaufmann, Amsterdam.
... Migration has attracted the attention of researchers from multiple disciplines, which range from sociology (Crawley and Skleparis 2018) to communication (Sajir and Aouragh 2019), and psychology (Goodman, Sirriyeh, and McMahon 2017;Volkan 2018) to information science (Aswad and Menezes 2018;Urchs et al. 2019;Vázquez and Pérez 2019;Khatua and Nejdl 2021b). Migration-related issues were probed in the context of France (Siapera et al. 2018), Germany (Riyadi and Widhiasti 2020;Siapera et al. 2018), Italy (Capozzi et al. 2020;Kim et al. 2020b), Korea (Kim et al. 2020a), Netherlands (Udwan, Leurs, and Alencar 2020), Spain (Calderón, de la Vega, and Herrero 2020; Vázquez and Pérez 2019), Syria (Dekker et al. 2018;Rettberg and Gajjala 2016;Reel et al. 2018;Öztürk and Ayvaz 2018;Udwan, Leurs, and Alencar 2020), Turkey (Bozdag and Smets 2017;Özerim and Tolay 2020), the UK (Coletto et al. 2016), and the USA (Zagheni et al. 2018). Information science researchers mostly analyzed online contents, such as Facebook (Capozzi et al. 2020;Hrdina 2016;Zagheni et al. 2018), Instagram (Guidry et al. 2018), Pinterest (Guidry et al. 2018), YouTube , Twitter (Alcántara-Plá and Ruiz-Sánchez 2018; Aswad and Menezes 2018;Calderón, de la Vega, and Herrero 2020;Gualda and Rebollo 2016;Kim et al. 2020;Nerghes and Lee 2018;Pope and Griffith 2016;Vázquez and Pérez 2019) as well as mainstream media (Nerghes and Lee 2019). ...
... The Twitter data was also used for sentiment analysis and opinion mining in the context of migration Reel et al. 2018). A multilingual study (German and English) considered two specific refugee-related events and performed sentiment analysis of Twitter discussions around these two events (Pope and Griffith 2016). ...
... Second, a far-right perspective where refugees are framed as terrorists or criminals, and subsequently, these create security and safety concerns in the host nation. These apprehensions towards migrants were also observed by other studies -especially in the context of Syrian refugees (Özerim and Tolay 2020;Öztürk and Ayvaz 2018;Reel et al. 2018). Reel et al. (2018) has proposed a random forest-based classifier to extract and identify tweets about Syrian refugees. ...
Article
Full-text available
We draw insights from the social psychology literature to identify two facets of Twitter deliberations about migrants, i.e., perceptions about migrants and behaviors towards migrants. Our theoretical anchoring helped us in identifying two prevailing perceptions (i.e., sympathy and antipathy) and two dominant behaviors (i.e., solidarity and animosity) of social media users towards migrants. We have employed unsupervised and supervised approaches to identify these perceptions and behaviors. In the domain of applied NLP, our study offers a nuanced understanding of migrant-related Twitter deliberations. Our proposed transformer-based model, i.e., BERT + CNN, has reported an F1-score of 0.76 and outperformed other models. Additionally, we argue that tweets conveying antipathy or animosity can be broadly considered hate speech towards migrants, but they are not the same. Thus, our approach has fine-tuned the binary hate speech detection task by highlighting the granular differences between perceptual and behavioral aspects of hate speeches.
... Migration has attracted the attention of researchers from multiple disciplines, which range from sociology (Crawley and Skleparis 2018) to communication (Sajir and Aouragh 2019), and psychology (Goodman, Sirriyeh, and McMahon 2017;Volkan 2018) to information science (Aswad and Menezes 2018;Urchs et al. 2019;Vázquez and Pérez 2019). Migration-related issues were probed in the context of France (Siapera et al. 2018), Germany (Riyadi and Widhiasti 2020;Siapera et al. 2018), Italy (Capozzi et al. 2020;Kim et al. 2020b), Korea (Kim et al. 2020a), Netherlands (Udwan, Leurs, and Alencar 2020), Spain (Calderón, de la Vega, and Herrero 2020;Vázquez and Pérez 2019), Syria (Dekker et al. 2018;Rettberg and Gajjala 2016;Reel et al. 2018;Öztürk and Ayvaz 2018;Udwan, Leurs, and Alencar 2020), Turkey (Bozdag and Smets 2017;Özerim and Tolay 2020), the UK (Coletto et al. 2016), and the USA (Zagheni et al. 2018). Information science researchers mostly analyzed online contents, such as Facebook (Capozzi et al. 2020;Hrdina 2016;Zagheni et al. 2018), Instagram (Guidry et al. 2018), Pinterest (Guidry et al. 2018), YouTube , Twitter (Alcántara-Plá and Ruiz-Sánchez 2018; Aswad and Menezes 2018; Calderón, de la Vega, and Herrero 2020; Gualda and Rebollo 2016;Kim et al. 2020;Nerghes and Lee 2018;Pope and Griffith 2016;Vázquez and Pérez 2019) as well as mainstream media (Nerghes and Lee 2019). ...
... The Twitter data was also used for sentiment analysis and opinion mining in the context of migration Reel et al. 2018). A multilingual study (German and English) considered two specific refugee-related events and performed sentiment analysis of Twitter discussions around these two events (Pope and Griffith 2016). ...
... Second, a far-right perspective where refugees are framed as terrorists or criminals, and subsequently, these create security and safety concerns in the host country. These apprehensions towards migrants were also observed by other studiesespecially in the context of Syrian refugees (Özerim and Tolay 2020;Öztürk and Ayvaz 2018;Reel et al. 2018). Reel et al. (2018) has proposed a random forest-based classifier to extract and identify tweets about Syrian refugees. ...
Preprint
Full-text available
We draw insights from the social psychology literature to identify two facets of Twitter deliberations about migrants, i.e., perceptions about migrants and behaviors towards mi-grants. Our theoretical anchoring helped us in identifying two prevailing perceptions (i.e., sympathy and antipathy) and two dominant behaviors (i.e., solidarity and animosity) of social media users towards migrants. We have employed unsuper-vised and supervised approaches to identify these perceptions and behaviors. In the domain of applied NLP, our study of-fers a nuanced understanding of migrant-related Twitter de-liberations. Our proposed transformer-based model, i.e., BERT + CNN, has reported an F1-score of 0.76 and outper-formed other models. Additionally, we argue that tweets con-veying antipathy or animosity can be broadly considered hate speech towards migrants, but they are not the same. Thus, our approach has fine-tuned the binary hate speech detection task by highlighting the granular differences between perceptual and behavioral aspects of hate speeches.
... Random Forest achieved the best results with 98% accuracy, using 80% of the data set for training and 20% for testing. The study of [26] proposes a Random Forest classifier that automatically determines which tweets are from Syrian refugees and produced a promising 81% correct classification result. On another topic [27], addresses the problem of the effectiveness of anti-bird devices intended to reduce transmission line failures caused by bird activity. ...
Article
Full-text available
We disclose a methodology to determine the participants in discussions and their contributions in social networks with a local relationship (e.g., nationality), providing certain levels of trust and efficiency in the process. The dynamic is a challenge that has demanded studies and some approximations to recent solutions. The study addressed the problem of identifying the nationality of users in the Twitter social network before an opinion request (of a political nature and social participation). The employed methodology classifies, via machine learning, the Twitter users’ nationality to carry out opinion studies in three Central American countries. The Random Forests algorithm is used to generate classification models with small training samples, using exclusively numerical characteristics based on the number of times that different interactions among users occur. When averaging the proportions achieved by inferences of the ratio of nationals of each country, in the initial data, an average of 77.40% was calculated, compared to 91.60% averaged after applying the automatic classification model, an average increase of 14.20%. In conclusion, it can be seen that the suggested set of method provides a reasonable approach and efficiency in the face of opinion problems.
Article
In a world where every 80th person is now forcibly displaced, using big data sources to improve planning processes is no longer a question of if, but a question of how. In recent years, UNHCR has intensified its efforts to integrate a variety of data sources, ranging from satellite imagery to newspapers to online digital data, into estimates of refugees and persons of concern. These novel data sources offer UNHCR an opportunity to improve planning about early warning and acute crisis situations. This paper outlines the potential of big data for practitioners within the area of predictive work in the humanitarian sector and presents examples of how some of those data sources are currently used in the organisation. In particular, it considers the opportunities and challenges of aligning big data with UNHCR's goals of accuracy, scalability and sampling bias adjustments.
Technical Report
Full-text available
For refugees seeking to reach Europe, the digital infrastructure is as important as the physical infrastructures of roads, railways, sea crossings and the borders controlling the free movement of people. It comprises a multitude of technologies and sources: mobile apps, websites, messaging and phone calling platforms, social media, translation services, and more. The smartphone is an essential tool for refugees because it provides access to a range of news and information resources that they depend on for their survival. Access to digital resources plays a crucial role in the planning and navigating of their perilous journeys, as well as in their protection and empowerment after arrival in Europe. But despite their utility, mobile phones have a paradoxical presence in the lives of refugees – they are both a resource and a threat. The digital traces that refugees’ phones leave behind make them vulnerable to surveillance and other dangers. The research on which this report is based was conducted collaboratively by The Open University and France Médias Monde between October 2015 and April 2016. Our aim was to assess whether the provision of news and information for refugees was adequate to their needs. So much is written about refugees but little by or for them. Their voices often get drowned out in the cacophony of media and political debate about how to tackle “the refugee crisis”. The problems are exacerbated by the lack of a pan-­‐European approach to the provision of reliable, relevant and timely information. Policy and practice are uncoordinated and ineffective. There are many initiatives using apps but the field is fragmented and there is little or no collaboration. It is our common European problem. European member states alongside international news media need urgently to work together to find solutions to the worst humanitarian crises in recent history. There are significant ethical and practical difficulties in researching refugees, including privacy, security, trust, and informed consent. Our research team was very mindful of these problems. Most of us have had direct experience as researchers and/or workers in NGOs and refugees’ support groups. The research was carried out on a shoestring budget offered by Centre for Research on Socio-­‐Cultural Change at the Open University. We are very grateful to the researchers who gave their time generously, some on top of already heavy workloads. Professor Heaven Crawley and her colleagues generously shared with us a database of some 500 interviews conducted as part of the MEDMIG project. As a multi-­‐disciplinary team we are skilled in using a range of different research methods. Mixed methods enable us to offer diverse perspectives on the problem and to seek effective solutions to the information gaps that refugees face, and which often make the difference between life and death. We also involve refugees themselves in participatory research practices. We want to ensure that understanding the actual uses of technological, news and informational resources will direct any initiatives to create new resources and so contribute to their success. Refugees are not a homogenous group. The experiences of men, women and children in different places are profoundly different, as are their demographic characteristics and ideological positions as well as linguistic, social and cultural competences and digital literacy. All these factors need to shape the development of any resources for refugees. This report summarises the first of three planned phases of research. We hope this work will lead to the provision of valuable resources for refugees. This first phase feeds into plans to develop digital resources for refugees in Europe and to make recommendations to support such plans. The second and third phases will involve developing resources and monitoring and evaluating progress.
Article
Full-text available
With the explosion of social media sites and proliferation of digital computing devices and Internet access, massive amounts of public data is being generated on a daily basis. Efficient techniques/ algorithms to analyze this massive amount of data can provide near real-time information about emerging trends and provide early warning in case of an imminent emergency (such as the outbreak of a viral disease). In addition, careful mining of these data can reveal many useful indicators of socioeconomic and political events, which can help in establishing effective public policies. The focus of this study is to review the application of big data analytics for the purpose of human development. The emerging ability to use big data techniques for development (BD4D) promises to revolutionalize healthcare, education, and agriculture; facilitate the alleviation of poverty; and help to deal with humanitarian crises and violent conflicts. Besides all the benefits, the large-scale deployment of BD4D is beset with several challenges due to the massive size, fast-changing and diverse nature of big data. The most pressing concerns relate to efficient data acquisition and sharing, establishing of context (e.g., geolocation and time) and veracity of a dataset, and ensuring appropriate privacy. In this study, we provide a review of existing BD4D work to study the impact of big data on the development of society. In addition to reviewing the important works, we also highlight important challenges and open issues.
Article
Full-text available
The last decade has witnessed very active development in two broad, but separate fields, both involving understanding and modeling of how individuals move in time and space (hereafter called “travel behavior analysis” or “human mobility analysis”). One field comprises transportation researchers who have been working in the field for decades and the other involves new comers from a wide range of disciplines, but primarily computer scientists and physicists. Researchers in these two fields work with different datasets, apply different methodologies, and answer different but overlapping questions. It is our view that there is much, hidden synergy between the two fields that needs to be brought out. It is thus the purpose of this paper to introduce datasets, concepts, knowledge and methods used in these two fields, and most importantly raise cross-discipline ideas for conversations and collaborations between the two. It is our hope that this paper will stimulate many future cross-cutting studies that involve researchers from both fields.
Conference Paper
Full-text available
Data about migration flows are largely inconsistent across countries, typically outdated, and often inexistent. Despite the importance of migration as a driver of demographic change, there is limited availability of migration statistics. Generally, researchers rely on census data to indirectly estimate flows. However, little can be inferred for specific years between censuses and for recent trends. The increasing availability of geolocated data from online sources has opened up new opportunities to track recent trends in migration patterns and to improve our understanding of the relationships between internal and international migration. In this paper, we use geolocated data for about 500,000 users of the social network website "Twitter". The data are for users in OECD countries during the period May 2011- April 2013. We evaluated, for the subsample of users who have posted geolocated tweets regularly, the geographic movements within and between countries for independent periods of four months, respectively. Since Twitter users are not representative of the OECD population, we cannot infer migration rates at a single point in time. However, we proposed a difference-in-differences approach to reduce selection bias when we infer trends in out-migration rates for single countries. Our results indicate that our approach is relevant to address two longstanding questions in the migration literature. First, our methods can be used to predict turning points in migration trends, which are particularly relevant for migration forecasting. Second, geolocated Twitter data can substantially improve our understanding of the relationships between internal and international migration. Our analysis relies uniquely on publicly available data that could be potentially available in real time and that could be used to monitor migration trends. The Web Science community is well-positioned to address, in future work, a number of methodological and substantive questions that we discuss in this article.
Article
Full-text available
The bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words, and then represent each image by a histogram of the visual words. For this purpose, a clustering algorithm (e.g., K-means), is generally used for generating the visual words. Although a number of studies have shown encouraging results of the bag-of-words representation for object categorization, theoretical studies on properties of the bag-of-words model is almost untouched, possibly due to the difficulty introduced by using a heuristic clustering process. In this paper, we present a statistical framework which generalizes the bag-of-words representation. In this framework, the visual words are generated by a statistical process rather than using a clustering algorithm, while the empirical performance is competitive to clustering-based method. A theoretical analysis based on statistical consistency is presented for the proposed framework. Moreover, based on the framework we developed two algorithms which do not rely on clustering, while achieving competitive performance in object categorization when compared to clustering-based bag-of-words representations. KeywordsObject recognition-Bag of words model-Rademacher complexity
Article
Full-text available
Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its apparent universality. Many results exist on the model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Article
This paper proposes an essay concerning the understanding of human behaviours and crisis management of crowds in extreme situations, such as evacuation through complex venues. The first part focuses on the understanding of the main features of the crowd viewed as a living, hence complex system. The main concepts are subsequently addressed, in the second part, to a critical analysis of mathematical models suitable to capture them, as far as it is possible. Then, the third part focuses on the use, toward safety problems, of a model derived by the methods of the mathematical kinetic theory and theoretical tools of evolutionary game theory. It is shown how this model can depict critical situations and how these can be managed with the aim of minimizing the risk of catastrophic events.
Conference Paper
Tag clouds have become a popular visualization and navigation interface on the Web. Despite their popularity, little is known about tag cloud perception and performance with respect to different user goals. This paper presents results from a comparative study of several tag cloud layouts. The results show differences in task performance, leading to the conclusion that interface designers should carefully select the appropriate tag cloud layout according to the expected user goals. Furthermore, the analysis of eye tracking data provides insights into the visual exploration strategies of tag cloud users.
Migrants' Use of Social Media in Malta. The People for Change Foundation
  • C M Cassar
  • J P Gauci
  • A Bacchi
Cassar, C.M., Gauci, J.P. and Bacchi, A. 2016. Migrants' Use of Social Media in Malta. The People for Change Foundation.