ArticlePDF Available

Nowcasting Events from the Social Web with Statistical Learning

Authors:

Abstract and Figures

We present a general methodology for inferring the occurrence and magnitude of an event or phenomenon by exploring the rich amount of unstructured textual information on the social part of the web. Having geo-tagged user posts on the microblogging service of Twitter as our input data, we investigate two case studies. The first consists of a benchmark problem, where actual levels of rainfall in a given location and time are inferred from the content of tweets. The second one is a real-life task, where we infer regional Influenza-like Illness rates in the effort of detecting timely an emerging epidemic disease. Our analysis builds on a statistical learning framework, which performs sparse learning via the bootstrapped version of LASSO to select a consistent subset of textual features from a large amount of candidates. In both case studies, selected features indicate close semantic correlation with the target topics and inference, conducted by regression, has a significant performance, especially given the short length –approximately one year– of Twitter's data time series.
Content may be subject to copyright.
72
Nowcasting Events from the Social Web with Statistical Learning
VASILEIOS LAMPOS and NELLO CRISTIANINI, University of Bristol, UK
We present a general methodology for inferring the occurrence and magnitude of an event or phenomenon
by exploring the rich amount of unstructured textual information on the social part of the Web. Having geo-
tagged user posts on the microblogging service of Twitter as our input data, we investigate two case studies.
The first consists of a benchmark problem, where actual levels of rainfall in a given location and time are
inferred from the content of tweets. The second one is a real-life task, where we infer regional Influenza-
like Illness rates in the effort of detecting timely an emerging epidemic disease. Our analysis builds on a
statistical learning framework, which performs sparse learning via the bootstrapped version of LASSO to
select a consistent subset of textual features from a large amount of candidates. In both case studies, selected
features indicate close semantic correlation with the target topics and inference, conducted by regression,
has a significant performance, especially given the short length –approximately one year– of Twitter’s data
time series.
Categories and Subject Descriptors: G.3 [Probability and Statistics]: Statistical computing; I.2.6 [Arti-
ficial Intelligence]: Learning; I.5.4 [Pattern Recognition]: Applications—Text processing;J.3[Life and
Medical Sciences]: Medical information systems
General Terms: Algorithms, Design, Experimentation, Measurement, Performance
Additional Key Words and Phrases: Event detection, feature selection, LASSO, social network mining,
sparse learning, Twitter
ACM Reference Format:
Lampos, V. and Cristianini, N. 2012. Nowcasting events from the social web with statistical learning. ACM
Trans. Intell. Syst. Technol. 3, 4, Article 72 (September 2012), 22 pages.
DOI = 10.1145/2337542.2337557 http://doi.acm.org/10.1145/2337542.2337557
1. INTRODUCTION
It has not been a long time since snapshots of real life started to appear on the social
side of the Web. Social networks such as Facebook and Twitter have grown stronger,
forming an electronic substitute for public expression and interaction. Twitter, in par-
ticular, counting a total of 200 million users worldwide,1came up with a convention
that encouraged users to make their posts, commonly known as tweets, by default pub-
licly available. Tweets being limited to a length of 140 characters (similarly to text mes-
sages in mobile phones) forced their authors to produce more topic specific statements.
1Based on an email update titled as ‘Get the most out of Twitter in 2011’, sent by Twitter Inc. to its users
(February 1, 2011).
V. Lampos would like to thank NOKIA Research, EPSRC (DTA/SB1826) and the Computer Science Depart-
ment (University of Bristol) for all the various levels of support. N. Cristianini is supported by a Royal
Society Wolfson Merit Award.
Authors’ address: V. Lampos and N. Cristianini, Level 0, Intelligent Systems Laboratory, Merchant Ventur-
ers Building, Woodland Road, BS8 1UB, Bristol, UK; email: bill.lampos@gmail.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permission may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2012 ACM 2157-6904/2012/09-ART72 $15.00
DOI 10.1145/2337542.2337557 http://doi.acm.org/10.1145/2337542.2337557
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:2 V. Lampos and N. Cristianini
By adding user’s location when posting (via the mobile phone service provider or IP
address) to this piece of public information, Twitter ushered in a new era for social
Web media and at the same time enabled a new wave of experimentation and research
on text stream mining. Now, it has been shown that this vast amount of data encap-
sulates useful signals driven by our everyday life and therefore, statistical learning
methods could be applied to extract them (several examples are provided in Section 2).
The term nowcasting, commonly used in finance, expresses the fact that we are mak-
ing inferences regarding the current magnitude M(ε) of an event ε. For a time interval
u=[tt,t], where tdenotes the current time instance, consider Mε(u)as a latent
variable. The Web content W(u)for this time interval is a partially observed variable;
in particular, data from a social network, denoted as S(u)W(u)are being observed.
In this work, S(u)is used to directly infer Mε(u). For short time intervals u, we are
inferring the present value of the latent variable, that is, we are nowcasting the mag-
nitude of an event. We have already presented preliminary results on a methodology
for tracking the level of a flu epidemic from Twitter content using unigrams [Lam-
pos and Cristianini 2010] and demonstrated an online tool2for this purpose [Lampos
et al. 2010]. Here we extend our previous findings and present a general framework
for exploiting user input published in social media.
Sparse learning enables us to select a consistent set of features (e.g., unigrams or
bigrams) and then use it to perform inference via regression. The performance of the
proposed methodology is evaluated by investigating two case studies. In the first, we
infer the daily amount of rainfall in five UK locations by using tweets; this forms a
benchmark problem testing the limits of our approach given that rainfall has a very
inconsistent behavior in the UK [Jenkins et al. 2008]. Ground truth consists of rainfall
observations taken from weather stations located in the vicinity of the target locations.
The second case study focuses on inferring the level of Influenza-like Illness (ILI) in the
population of three UK regions based again on geolocated Twitter content. Results are
validated by being compared with actual ILI rates measured by the Health Protection
Agency (HPA).3In both case studies, experimental results are very positive in terms of
the semantic correlation between selected features and target topics, and strong given
the general inference performance.
The specific procedure that we followed, namely using Bolasso [Bach 2008] for
feature selection from a large set of candidates has proven to work best compared
to another relevant state-of-the-art approach [Ginsberg et al. 2008], but the general
claim is that statistical learning techniques can be deployed for the selection of
features and, at the same time, for the inference of a useful statistical estimator. Com-
parisons with other variants of Machine Learning methods may be of interest, though
they would not change the main message: that one can learn the estimator from data,
by means of supervised learning. In the case of ILI, other methods (e.g., Corley et al.
[2009]; Polgreen et al. [2008]) propose to simply count the frequency of the disease
name. This can work well when people can diagnose their own disease (maybe easier
in some cases than others) and no other confounding factors exist. However, from our
experimental results (see Sections 5, 6, and 7), one can conclude that this is not an
optimal choice. Furthermore, it is not obvious that a function of Twitter content should
correlate with the actual health state of a population. There are various possible
sampling biases that may prevent this signal from emerging. An important result of
this study is that we find that it is possible to make up for any such bias by calibrating
the estimator on a large dataset of Twitter posts and actual HPA readings; similar
2Flu Detector, http://geopatterns.enm.bris.ac.uk/epidemics/.
3HPA’s weekly epidemiological updates archive is available at http://goo.gl/wJex.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:3
results are derived for the rainfall case study. While it is true that Twitter users do
not represent the general population and Twitter content might not represent any
particular state of theirs, we find that actual states of the general population (health
or weather oriented) can be inferred as a linear function of the signal in Twitter.
The content of this article is laid out as follows: related work and background theo-
retical foundations are provided in Section 2; the proposed methodology, the perfor-
mance evaluation procedure and the baseline approach, to which we compare our
results, are described in Section 3; Section 4 is concerned with the technical details
of information collection and retrieval explaining how Twitter is sampled and also de-
fines the classes of features used in our approach; Sections 5 and 6 include a detailed
presentation and analysis of the experimental results for the case studies of rainfall
and flu nowcasting respectively; finally, Section 7 further discusses the derivations of
this work, followed by the conclusions and future work in Section 8.
2. RELATED WORK AND THEORETICAL FOUNDATIONS
2.1. Related Work in Mining User-Generated Content
Recent work has been concentrated on exploiting user-generated Web content for con-
ducting several types of inference. A significant subset of papers, examples of which
are given in this paragraph, focuses on methodologies that are based either on man-
ually selected textual features related to a latent event, for instance, flu related key-
words, or the application of sentiment/mood analysis, which in turn implies the use of
predefined vocabularies, where words or phrases have been mapped to sentiment or
mood scores [Pang and Lee 2008]. Corley et al. [2009] reported a 76.7% correlation
between official ILI rates and the frequency of certain hand-picked influenza related
words in blog posts [Corley et al. 2009], whereas similar correlations were shown be-
tween user search queries that included illness related words and CDC4rates [Pol-
green et al. 2008]. Furthermore, sentiment analysis has been applied in the effort of
extracting voting intentions [Tumasjan et al. 2010] or box-office revenues [Asur and
Huberman 2010] from Twitter content. Similarly, mood analysis combined with a non-
linear regression model derived an 87.6% correlation with daily changes in Dow Jones
Industrial Average closing values [Bollen et al. 2011]. Finally, Sakaki et al. [2010]
presented a method that exploited the content, time stamp and location of a tweet to
detect the existence of an earthquake.
However, in other approaches feature selection is performed automatically by ap-
plying statistical learning methods. Apart from the obvious advantage of reducing
human involvement to a minimum, those methods tend to have an improved inference
performance as they are enabled to explore the entire feature space or, in general, a
greater amount of candidate features [Guyon and Elisseeff 2003]. In Ginsberg et al.
[2008] Google researchers proposed a model able to automatically select flu related
user search queries, which later on were used in the process of tracking ILI rates.
Their method, a core component of Google Flu Trends, achieved an average correla-
tion of 90% with CDC data, much higher than any other previously reported method.
An extension of this approach has been applied on Twitter data achieving a 78% cor-
relation with CDC rates [Culotta 2010]. In both those works, features were selected
based on their individual correlation with ILI rates; the subset of candidate features
(user search queries or keywords) appearing to independently have the highest lin-
ear correlations with the target values formed the result of feature selection. Another
technique, part of our preliminary results, which applied sparse regression on Twit-
ter content for automatic feature selection, resulted to a greater than 90% correlation
4Centers for Disease Control and Prevention (CDC), http://www.cdc.gov/.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:4 V. Lampos and N. Cristianini
with HPA’s flu rates for several UK regions [Lampos and Cristianini 2010]; an im-
proved version of this methodology has been incorporated in Flu Detector [Lampos
et al. 2010], an online tool for inferring flu rates based on tweets.
Besides minor differences regarding the information extraction and retrieval
techniques or the datasets considered, the fundamental distinction between Ginsberg
et al. [2008] and Culotta [2010] and Lampos and Cristianini [2010] lies on the feature
selection principle; a sparse regressor, such as LASSO, does not handle each candidate
feature independently but searches for a subset of features that satisfies its constraints
[Tibshirani 1996] (see Section 2.2). In this work, we extend and generalize the method-
ology and preliminary results presented in Lampos and Cristianini [2010]. The main
theoretical concept is again feature selection by sparse learning, though we aim to
make this selection consistent considering, at the same time, more types of features.
2.2. Bootstrapped LASSO for Feature Selection
Least Absolute Shrinkage and Selection Operator (LASSO), presented in Tibshirani
[1996], being a constrained version of ordinary least squares (OLS) regression, pro-
vides a sparse regression estimate βcomputed by solving the following optimization
problem:
β= arg min
β
N
i=1
yiβ0
p
j=1
xijβj
2
subject to
p
j=1 |βj|≤t,(1)
where x’s denote the input data (Nobservations of pvariables), y’s are the Ntarget
values, β’s the Ncoefficients or weights, β0is the regression bias and t0 is referred
to as the regularization or shrinkage parameter since it controls the regularization (or
shrinkage) amount on the L1-norm of β’s. Least Angle Regression (LARS) provides an
efficient algorithm for computing the entire regularization path of LASSO [Efron et al.
2004], that is, all LASSO solutions for different choices of the regularization parameter
t. However, it has been shown that LASSO selects more variables than necessary [Lv
and Fan 2009] and that in many settings it performs an inconsistent model selection
[Zhao and Yu 2006].
Bootstrap, presented in Efron [1979], was introduced as a method for assessing the
accuracy of a prediction but has also found applications in improving the prediction
itself (see for example Bagging [Breiman 1996]). Suppose that we aim to fit a model to
a training dataset T. The basic idea of bootstrapping is to draw nrandom datasets B
with replacement from T, forcing each sample to have the same size as |T|; the drawn
datasets are referred to as bootstraps. Then, refit the model into each element of B
and examine the behavior of the fits [Efron and Tibshirani 1993]. The bootstrapped
version of LASSO, conventionally named as Bolasso, intersects the supports of LASSO
bootstrap estimates and addresses its model selection inconsistency problems [Bach
2008]. Throughout this work we have applied Bolasso’s soft version (see Sections 3.1
and 3.2) in our effort to select a consistent subset of textual features.
3. GENERAL METHODOLOGY
In this section, a general description of the proposed methodology is given, introducing
the notation that is going to be used throughout this script. An abstract summary of
the methodology includes the following three main operations.
(1) Candidate Feature Extraction. A vocabulary of candidate features is formed by us-
ing n-grams, that is, phrases with ntokens. We also refer to those n-grams as mark-
ers. Markers are extracted from text, which is expected to contain topic-related
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:5
words, for instance, Web encyclopedias as well as other more informal references.
By construction the set of extracted candidates contains many features relevant
with the target topic and much more with no direct semantic connection.
(2) Vector Space Representation. For a fixed time period and set of locations, the Vector
Space Representation (VSR) of the candidate features is computed from the text
corpus using a scheme based on Term Frequencies (TF). For the same time period
and locations, the VSR of the target topic is obtained from an authoritative source.
(3) Feature Selection and Inference. A subset of the candidate features is selected by
applying a sparse regression method. In our experiments, we have applied Bolasso
to select a consistent set of features; the weights of the selected features are then
learnt via OLS regression on the reduced input space. The selected features and
their weights are used to perform inferences.
3.1. Formal Description
We denote the set of candidate n-grams as C={ci},i∈{1, ..., |C|}. The retrieved user
posts (or tweets) for a time instance uare denoted as P(u)={pj},j∈{1, ..., |P(u)|}.A
boolean function gindicates whether a candidate marker ciis contained in a user post
pjor not:
g(ci,pj)=1ifcipj,
0 otherwise. (2)
Given the user posts P(u), we compute the score sof a candidate marker cias follows:
sci,P(u)=
|P(u)|
j=1
g(ci,pj)
|P(u)|.(3)
Therefore, the score of a candidate marker is the number of tweets containing this
marker divided by the total number of tweets for a predefined time interval. The scores
of all candidate markers for the same time interval uare kept in vector xgiven by:
x(u)=sc1,P(u)... sc|C|,P(u)T
.(4)
In our study, utakes the length of a day d; from this point onwards, consider a time
interval equal to the duration of a day. However, u’s length is a matter of choice
depending on the inference task at hand.
For a set of days D={dk},k∈{1, ..., |D|} and given P(dk)k, we compute the scores
of the candidate markers C. Those are held in a |D|×|C|array X(D):
X(D)=x(d1)... x(d|D|)T
.(5)
For the same set of |D|days, we retrieve the values of the target variable y(D):
y(D)=y1... y|D|T.(6)
X(D)and y(D)are used as an input in Bolasso. In each bootstrap, LASSO selects a sub-
set of the candidates and at the end Bolasso, by intersecting the bootstrap outcomes,
attempts to make this selection consistent. LASSO is formulated as follows:
min
w
X(D)wy(D)
2
2
s.t. w1t,
(7)
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:6 V. Lampos and N. Cristianini
where tis the regularization parameter controlling the shrinkage of w’s L1-norm. In
turn, tcan be expressed as
t=α·wOLS1(0,1],(8)
where wOLS is the OLS regression solution and adenotes the desired shrinkage percent-
age of wOLS’s L1-norm. Bolasso’s implementation applies LARS, which is able to explore
the entire regularization path at the cost of one matrix inversion and decides the value
of the regularization parameter (tor α) using the largest consistent region, that is, the
largest continuous range on the regularization path, where the set of selected variables
remains the same [Efron et al. 2004].
After selecting a subset F={fi},i∈{1, ..., |F|} of the feature space, where FC,
the VSR of the initial vocabulary X(D)is reduced to an array Z(D)of size |D|×|F|.We
learn the weights of the selected features by performing OLS regression:
min
ws
Z(D)ws+βy(D)
2
2,(9)
where vector wsdenotes the learned weights for the selected features and scalar βis
regression’s bias term.
It is important to notice that statistical bounds exist linking LASSO’s expected per-
formance to the one derived on the training set (empirical), the number of dimensions,
number of training samples and 1-norm of w. For example in Bartlett et al. [2009] it
is shown that LASSO’s expected loss L(w) up to polylogarithmic factors in W1,|C|and
|D|is bounded by
L(w)ˆ
L(w)+Q,withQmin W2
1
|D|+|C|
|D|,W2
1
|D|+W1
|D|,(10)
where ˆ
L(w) denotes the empirical loss, |C|is the number of candidate features, |D|is
the number of training samples and W1is an upper bound for the 1-norm of w,that
is, w1W1. Therefore, to minimize the prediction error using a fixed set of training
samples and given that the empirical error is relatively small, one should either reduce
the dimensionality of the problem (|C|) or increase the shrinkage of w’s 1-norm (which
intuitively might result in sparser solutions).
3.2. Consensus Threshold and Performance Evaluation
A strict application of Bolasso implies that only features with a nonzero weight in all
bootstraps are going to be considered. In our methodology a soft version of Bolasso is
applied (named as Bolasso-S in Bach [2008]), where features are considered if they ac-
quire a nonzero weight in a fraction of the bootstraps, which is referred to as Consensus
Threshold (CT). CT ranges in (0,1] and obviously is equal to 1 in the strict application
of Bolasso. The value of CT, expressed by a percentage, is decided using a validation
set. To constrain the computational complexity of the learning phase, we consider 21
discrete CTs from 50% to 100% with a step of 2.5%.
Overall, performance evaluation includes three steps: (a) training, where for each
CT we retrieve a set of selected features from Bolasso and their weights from OLS
regression, (b) validating CT, where we select the optimal CT value based on a valida-
tion set, and (c) testing, where the performance of our previous choices is computed.
Training, validation and testing sets are by definition disjoint from each other.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:7
ALGORITHM 1: Baseline Method: Feature Selection via Correlation Analysis
Input:C1:n,X(train)
[1:m,1:n],y(train)
1:m,X(val)
[1:m,1:n],y(val)
1:m
Output:ˆ
C1:p
ρ1:ncorrelation X(train)
[1:m,1:n],y(train)
1:m;
ˆρ1:ndescendingCorrelationIndex(ρ1:n);
ˆ
C1:nCˆρ1:n;
while ikdo
Livalidate X(train)
[1:m,ˆρ1:i],y(tr ain)
1:m,X(val)
[1:m,ˆρ1:i],y(val )
1:m;
end
parg min
i
Li;
return ˆ
C1:p;
The Mean Squared Error (MSE) between inferred (Xw) and target values (y) forms
theloss(L) during all steps. For a sample of size |D|this is defined as:
L(w)= 1
|D|
|D|
i=1
(xiw,yi),(11)
where the loss function (xiw,yi)=(xiw−yi)2. The Root Mean Squared Error
(RMSE) – the square root of MSE – has been used as a more comprehensive metric
(it has the same units with the target variables) for presenting results in Sections 5
and 6.
To summarise CT’s validation, suppose that for all considered consensus thresholds
CTi,i∈{1, ..., 21}, training yields Fisets of selected features respectively, whose losses
on the validation set are denoted by L(val)
i. Then, if i()denotes the index of the selected
CT and set of features, it is given by:
i()= arg min
i
L(val)
i.(12)
Therefore, CTi()is the result of the validation process and Fi()is used in the testing
phase.
Taking into consideration that both target values (rainfall and flu rates) can only
be zero or positive, we threshold the negative inferred values with zero during testing,
that is, ximax{xi,0}. We perform this filtering only in the testing phase; during
CT’s validation, we want to keep track of deviations in the negative space as well.
As part of the evaluation process, we compare our results with a baseline approach
that encapsulates the methodologies in Ginsberg et al. [2008] and Culotta [2010].
Those approaches, as explained in Section 2, mainly differ in the feature selection pro-
cess that is performed via correlation analysis (Algorithm 1). Briefly, given a set Cof n
candidate features, their computed VSRs for training and validation X(train)and X(val )
and the corresponding response values y(tr ain)and y(val), this feature selection process:
a) computes the Pearson correlation coefficients (ρ) between each candidate feature
and the response values in the training set, b) ranks the retrieved correlation coeffi-
cients in descending order, c) computes the OLS-fit loss (L) of incremental subsets of
the top-kcorrelated terms on the validation set and d) selects the subset of candidate
features with the minimum loss. The inference performance of the selected features is
evaluated on a (disjoint) test set.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:8 V. Lampos and N. Cristianini
4. DATA COLLECTION AND INFORMATION RETRIEVAL
For the experimental purposes of this work we use millions of tweets collected via Twit-
ter’s Search API and ground truth from authoritative sources. Based on the fact that
information geolocation is a key concept in both case studies, we are considering only
tweets tagged with the location (longitude and latitude coordinates) of their author.
We use UK’s 54 most populated urban centers and collect tweets geolocated within a
10km range from each one of them. Our crawler exploits Atom feeds and periodically
retrieves the 100 most recent tweets per urban center.5The time interval between con-
secutive queries for an urban center varied from 5 to 10 minutes but has been stable
on a daily basis and always the same for all locations. Therefore, a sampling method is
carried out during collection; we try to reduce sampling biases (for the purposes of our
work) by using the same sampling frequency per urban center. Collecting all tweets,
apart from being a much more resource demanding process, would also have resulted
in exceeding data collection limits set by Twitter. Nonetheless, the daily number of
collected tweets (more than 200,000) is considered adequate for the experimental part
of this work.
All collected tweets are stored and indexed in a MySQL database. Text preprocess-
ing such as stemming by applying Porter’s Algorithm for English language [Porter
1980], stop word and punctuation removal as well as the computation of VSRs are
performed by our software libraries. VSRs are formed using a TF binary vector space
model as already described in Section 3.1.
Candidate features, that is, the pool or vocabulary of n-grams on which feature se-
lection is applied, are extracted from encyclopedic, scientific, or more informal Web
references related to the inference topic.6By performing feature extraction in this
way, we secure the existence of good candidate features, but we are also enabled to
test the feature selection capability of our method, since most candidates are not di-
rectly related to the target topic. A typical information retrieval approach would have
implied the creation of a vocabulary index from the entire Twitter corpus [Manning
et al. 2008]; our choice is extensively justified in the discussion section. Neverthe-
less, acquired results indicate that our simplification in the feature extraction process
results in a significant inference performance.
4.1. Feature Classes
Three classes of candidate features have been investigated: unigrams or 1-grams (de-
noted by U), bigrams or 2-grams (B) and a hybrid combination (H) of 1-grams and
2-grams. 1-grams being single words cannot be characterized by a consistent semantic
interpretation in most of the topics. They take different meanings and express distinct
outcomes based on the surrounding textual context. 2-grams on the other hand can
be more focused semantically. However, their frequency in a corpus is expected to be
lower than the one of 1-grams. Particularly, in the Twitter corpus, which consists of
very short pieces of text (tweets are at most 140 characters long), their occurrences are
expected to be sometimes close to zero.
The hybrid class of features exploits the advantages of classes Uand Band reduces
the impact of their disadvantages. It is formed by combining the training results of
Uand Bfor all CTs. Validation and testing are performed on the combined datasets.
Suppose that for all considered consensus thresholds CTi,i∈{1, ..., |CT|}, 1-grams
5For an area formed by a center with coordinates –latitude and longitude– (X,Y) and a radius of R Km,
the N most recent tweets written in English language are retrieved by performing the following query:
http://search.twitter.com/search.atom?geocode=X,Y, R&lang=en&rpp=N.
6Lists of Web references and extracted features for the investigated case studies in this article are available
at http://geopatterns.enm.bris.ac.uk/twitter/.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:9
and 2-grams selected via Bolasso are denoted by F(U)
iand F(B)
irespectively. Then, the
pseudo-selected n-grams for all CTs for the hybrid class F(H)
iare formed by their union,
F(H)
i=F(U)
iF(B)
i,i∈{1, ..., |CT|}. Likewise, Z(H)
i=Z(U)
iZ(B)
i,i∈{1, ..., |CT|},
where Zdenotes the VSR of each feature class (using Section’s 3.1 notation). Valida-
tion and testing are performed on Z(H)
ias it has already been described in Section 3.2.
Note that compiling an optimal hybrid scheme is not the main focus here; our aim is
to investigate whether a simple combination of 1-grams and 2-grams is able to deliver
better results. The experimental results (see Sections 5 and 6) do indeed indicate that
feature class Hperforms on average better than Uand B.
5. NOWCASTING RAINFALL RATES FROM TWITTER
In the first case study, we exploit the content of Twitter to infer daily rainfall rates
(measured in millimetres of precipitation) for five UK cities, namely Bristol, London,
Middlesbrough, Reading and Stoke-on-Trent. The choice of those locations has been
based on the availability of ground truth, that is, daily rainfall measurements from
weather stations installed in their vicinity.
We consider the inference of precipitation levels at a given time and place as a
good benchmark problem, in that it has many of the properties of other more useful
scenarios, while still allowing us to verify the performance of the system, since rainfall
is a measurable variable. The event of rain is a piece of information available to the
significant majority of Twitter users and affects various activities that could form a
discussion topic in tweets. Furthermore, predictions about it are not always easy due
to its nonsmooth behavior [Jenkins et al. 2008].
The candidate markers for this case study are extracted from weather related
Web references, such as Wikipedia’s page on Rainfall, an English language course
on weather vocabulary, a page with formal weather terminology and several others.
As already mentioned in Section 4, the majority of the extracted candidate features
is not directly related to the target topic, but there exists a subset of markers that
could probably offer a good semantic interpretation. Markers with a count 10 in the
Twitter corpus used for this case study are removed. Hence, from the extracted 2381
1-grams, 2159 have been kept as candidates; likewise the 7757 extracted 2-grams have
been reduced to 930.
5.1. Experimental Settings
A year of Twitter data and rainfall observations (from the July 1, 2009 to the June
30, 2010) formed the input data for this experiment. For this time period and the
considered locations, 8.5 million tweets have been collected. In each run of Bolasso the
number of bootstraps is proportional to the size of the training sample (approximately
13% using the same principle as in Bach [2008]), and in every bootstrap we select at
most 300 features by performing at most 900 iterations. A bootstrap is completed as
soon as one of those two stopping criteria is met. This is an essential trade-off that
guarantees a quicker execution of the learning phase, especially when dealing with
large amounts of data.
The performance of each feature class is computed by applying a 6-fold cross valida-
tion. Each fold is based on 2 months of data starting from the month pair July-August
(2009) and ending with May-June (2010). In every step of the cross validation, 5
folds are used for training, the first half (a month-long data) of the remaining fold for
validating CT and the second half for testing the performance of the selected markers
and their weights. Training is performed by using the VSRs of all five locations in
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:10 V. Lampos and N. Cristianini
Ta b l e I.
Nowcasting Rainfall Rates – Derived Consensus Thresholds and numbers of selected
features (in parentheses) for all Feature Classes (FC) in the rounds of 6-fold cross
validation – Fold idenotes the validation/testing fold of round 7 i.
FC Fold 6 Fold 5 Fold 4 Fold 3 Fold 2 Fold 1
U100% (4) 92.5% (19) 90% (17) 92.5% (12) 90% (17) 75% (28)
B90% (21) 67.5% (10) 95% (10) 67.5% (15) 90% (9) 62.5% (38)
H100% (8) 92.5% (27) 95% (21) 92.5% (27) 90% (26) 52.5% (131)
Table II.
Nowcasting Rainfall Rates – RMSEs (in mm) for all Feature Classes (FC) and locations in the rounds of 6-fold cross
validation – Fold idenotes the validation/testing fold of round 7 i. The last column holds the RMSEs of the baseline
method.
Location FC Fold 6 Fo ld 5 Fold 4 Fo ld 3 Fold 2 Fo ld 1 Mean RMSE BS-Mean RMSE
Bristol U1.164 1.723 1.836 2.911 1.607 2.348 1.931 2.173
B1.309 1.586 2.313 3.371 1.59 1.409 1.93 2.218
H1.038 1.631 2.334 2.918 1.579 2.068 1.928 2.094
London U1.638 1.507 5.079 2.582 1.62 6.261 3.115 3.297
B1.508 5.787 4.887 3.403 1.478 6.568 3.939 4.305
H1.471 1.526 4.946 2.813 1.399 6.13 3.047 4.101
Middlesbrough U4.665 1.319 3.102 2.618 2.949 2.536 2.865 2.951
B4.355 1.069 3.379 2.22 2.918 2.793 2.789 2.946
H4.47 1.098 3.016 2.504 2.785 2.353 2.704 6.193
Reading U2.075 1.566 2.087 2.393 1.981 2.066 2.028 2.168
B0.748 2.74 1.443 3.016 1.572 3.429 2.158 2.159
H1.636 1.606 1.368 2.571 1.695 2.145 1.836 2.214
Stoke-on-Trent U3.46 1.932 1.744 4.375 2.977 1.962 2.742 2.855
B3.762 1.493 1.433 2.977 2.447 2.668 2.463 2.443
H3.564 1.37 1.499 3.785 2.815 1.931 2.494 2.564
Tot al R MS E U2.901 1.623 3.04 3.062 2.31 3.443 2.73 2.915
B2.745 3.062 2.993 3.028 2.083 3.789 2.95 3.096
H2.779 1.459 2.937 2.954 2.145 3.338 2.602 4.395
a batch dataset, CT’s validation is carried out on the same principle (i.e., we learn
the same markers-weights under the same CT for all locations), and finally testing
is done both on the batch dataset (to retrieve a total performance evaluation) and on
each location separately. Finally, we also compute the inference performance of the
baseline approach for feature selection (Algorithm 1) for the same training, validation,
and testing sets, considering the top k= 300 correlated terms.
5.2. Results
The derived CTs as well as the numbers of selected features for all rounds of the 6-fold
cross validation are presented on Table I. In most rounds CT values are close to 90%
meaning that a few markers were able to capture the rainfall rates signal. However, in
the last round, where the validation dataset is based on July 2009, CTs for all feature
classes are significantly lower, which can be interpreted by the fact that July is the 2nd
most rainy month in our dataset, but also a summer month; therefore, tweets for rain
could be followed or preceded by tweets discussing a sunny day, creating instabilities
during the validation process. In addition, our dataset is restricted to only one year of
weather observations, and therefore seasonal patterns like this one are not expected
to be captured properly.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:11
Table III.
Feature Cla ss U– 1-grams selected by Bolasso for Rainfall case study (Round 5 of 6-fold cross validation) –
All weights (w) should be multiplied by 103.
1-gram w 1-gram w 1-gram w 1-gram w 1-gram w
flood 0.767 piss 0.247 rainbow 0.336 todai 0.055 wet 0.781
influenc 0.579 pour 1.109 sleet 1.766 town 0.134
look 0.071 puddl 3.152 suburb 1.313 umbrella 0.223
monsoon 2.45 rain 0.19 sunni 0.193 wed 0.14
Ta b l e IV.
Feature Class B– 2-grams selected by Bolasso for Rainfall case study (Round 5 of 6-fold cross validation) –
All weights (w) should be multiplied by 103.
2-gram w 2-gram w 2-gram w 2-gram w 2-gram w
air travel 2.167 light rain 2.508 raini dai 2.046 stop rain 3.843 wind rain 5.698
horribl weather 3.295 pour rain 7.161 rain rain 4.490 sunni dai 0.97
Ta b l e V.
Feature Class H– Hybrid selection of 1-grams and 2-grams for Rainfall case study (Round 5 of 6-fold cross
validation) – All weights (w) should be multiplied by 103.
n-gram w n-gram w n-gram w n-gram w n-gram w
air travel 1.841 monsoon 2.042 rain rain 2.272 sunni 0.125 wet 0.524
flood 0.781 piss 0.24 rainbow 0.294 sunni dai 0.165 wind rain 3.399
horribl weather 1.282 pour 0.729 raini dai 1.083 todai 0.041
influenc 0.605 pour rain 2.708 sleet 1.891 town 0.112
light rain 2.258 puddl 3.275 stop rain 2.303 umbrella 0.229
look 0.067 rain 0.122 suburb 1.116 wed 0.1
Detailed performance evaluation results (total and per location for all feature
classes) are presented on Table II. For a better interpretation of the numerical values
(in mm), consider that the average rainfall rate in our dataset is equal to 1.8witha
standard deviation of 3.9 and a range of [0,65]. Our method outperforms the baseline
approach (see Algorithm 1 in Section 3.2) in all-but-one intermediate RMSE indica-
tions as well as in total for all feature classes, achieving an improvement of 10.74%
(derived by comparing the lowest total RMSEs for each method). The overall perfor-
mance for our method indicates that feature class Hperforms better than both Uand
B– in the results per location, feature class Hhas the best performance 11 times (out
of 30), the same holds for B,andUis better 8 times.
Presenting all intermediate results for each round of the cross validation would
have been intractable. In the remaining part of this Section, we present the results
of learning and testing for cross-validation’s round 5 only, where the month of testing
is October 2009. Tables III, IV, and V list the selected features in alphabetical order
together with their weights for feature classes U,Band Hrespectively. For class H
we have also compiled a word cloud with the selected features as a more comprehen-
sive representation of the selection outcome (Figure 1). The majority of the selected
1-grams (Table III) has a very close semantic connection with the underlying topic;
stem ‘puddl’ holds the largest weight, whereas stem ‘sunni’ has taken a negative
weight and interestingly, the word ‘rain’ has a relatively small weight. There also ex-
ist a few words without a direct semantic connection, but the majority of them has
negative weights and in a way they can act as mitigators of non weather related uses
of the remaining rainy weather oriented and positively weighted features. The se-
lected 2-grams (Table IV) have a clearer semantic connection with the topic with ‘pour
rain’ acquiring the highest weight. In this particular case, the features for class Hare
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:12 V. Lampos and N. Cristianini
Fig. 1. Table V in a word cloud, where font size is proportional to regression’s weight and flipped words
have negative weights.
Fig. 2. Feature Class U– Inference for Rainfall case study (Round 5 of 6-fold cross validation).
formed by the exact union of the ones in classes Uand B, but take different weights
(Table V and Figure 1).
Inference results per location for cross validation’s round 5 are presented in
Figures 2, 3, and 4 for U,B,andHfeature classes respectively. Overall, inferences
follow the pattern of actual rain; for feature class B, we see that inferences in some
occasions appear to have a positive lower bound (see Figure 3(e)) that is actually the
positive bias term of OLS regression appearing when the selected markers have zero
frequencies in the daily Twitter corpus of a location. As mentioned before this problem
is resolved in Hsince it is very unlikely for 1-grams to also have a zero frequency.
Results for class Hdepicted in Figures 4(a) (Bristol), 4(b) (London) and 4(d) (Reading)
demonstrate a good fit with the target signal.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:13
Fig. 3. Feature Class B– Inference for Rainfall case study (Round 5 of 6-fold cross validation).
Fig. 4. Feature Class H– Inference for Rainfall case study (Round 5 of 6-fold cross validation).
6. NOWCASTING FLU RATES FROM TWITTER
In the second case study, we use the content of Twitter to infer regional flu rates in the
UK. We base our inferences in three UK regions, namely, Central England and Wales,
North England and South England. Ground truth, that is, official flu rate measure-
ments, is derived from HPA. HPA’s weekly reports are based on information collected
from the Royal College of General Practitioners (RCGP) and express the number of GP
consultations per 100,000 citizens, where the result of the diagnosis was ILI. According
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:14 V. Lampos and N. Cristianini
to HPA, a flu rate less than or equal to 30 is considered as baseline, flu rates below 100
are normal, between 100 and 200 are above average and over 200 are characterized as
exceptional.7To create a daily representation of HPA’s weekly reports we use linear
interpolation between the weekly rates. Given the flu rates riand ri+1 of 2 consequent
weeks, we compute a step factor δfrom
δ=ri+1 ri
7,(13)
and then produce flu rates for the days in between using the equation
dj=dj1+δ,j∈{2, ..., 7},(14)
where djdenotes the flu rate of week’s day jand d1=ri. The main assumption here
is that a regional flu rate will be monotonically increasing or decreasing within the
duration of a week.
Candidate features are extracted from several Web references, such as the Web sites
of National Health Service, BBC, Wikipedia, and so on, following the general principle,
that is, including encyclopedic, scientific, and more informal input. Similarly to the
previous case study, extracted 1-grams are reduced from 2428 to 2044, and extracted
2-grams from 7589 to 1678. Here, we have removed n-grams with a count 50 since
the number of tweets involved is approximately 5 times larger compared to the rainfall
case study.
In the flu case study, not enough peaks are present in the ground truth time se-
ries, as the collected Twitter data cover only one flu period with above average rates
(Swine Flu epidemic in June-July 2009). During performance evaluation, this results
in training mostly on nonflu periods where there is no strong flu signal; hence, feature
selection under those conditions is not optimal. To overcome this and assess prop-
erly the proposed methodology, we perform a random permutation of all data points
based on their day index. The randomly permuted result for South England’s flu rate
is shown on Figure 10(c); we apply the same randomized index on all regions during
performance evaluation.
6.1. Experimental Settings
For this experiment, we considered tweets and ground truth in the time period be-
tween the June 21, 2009 and April 19, 2010 (303 days). The total number of tweets
used reaches approximately 50 million. Similarly to the previous case study, we are
applying 5-fold cross validation using data of 60 or 61 days per fold. In each round of
the cross-validation, 4 folds are used for training. From the remaining fold, 30 days of
data are used for validating CT and the rest for testing. Bolasso settings are identical
to the ones used in the rainfall case.
Notice that in the following experiments data points are not contiguous in terms of
time since they have been permuted randomly based on their day index (as explained
in the previous section). However, we have included an example at the end of the next
section, where contiguous (time-wise) training, validating and testing data points have
been used.
6.2. Results
The derived CTs and numbers of selected features for all rounds of the 5-fold cross
validation are presented on Table VI. In the flu case study, most CTs (especially in U
and Hfeature classes) get a value close to the lower bound (50%) after validation, and
7“Interpreting the HPA Weekly National Influenza Report”, July 2009 – http://goo.gl/GWZmB.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:15
Ta b l e V I .
Nowcasting Flu Rates – Derived Consensus Thresholds and numbers of se-
lected features (in parentheses) for all Feature Classes (FC) in the rounds
of 5-fold cross validation – Fold idenotes the validation/testing fold of round
6i.
FC Fold 5 Fold 4 Fold 3 Fold 2 Fol d 1
U52.5% (90) 52.5% (100) 52.5% (108) 62.5% (67) 50% (62)
B55% (42) 62.5% (47) 92.5% (14) 85% (10) 52.5% (36)
H55% (124) 62.5% (131) 52.5% (151) 60% (103) 50% (100)
Table VII.
Nowcasting Flu Rates – RMSEs for all Feature Classes (FC) and locations in the rounds of 5-fold cross validation –
Fold idenotes the validation/testing fold of round 6 i. The last column holds the RMSEs of the baseline method.
Region FC Fold 5 Fold 4 Fol d 3 Fol d 2 Fol d 1 Mean RMSE BS-Mean RMSE
Central England U11.781 9.005 16.147 13.252 10.912 12.219 12.677
&Wales B11.901 12.2 21.977 12.426 14.615 14.624 15.665
H8.36 8.826 14.618 12.312 12.62 11.347 11.691
North England U9.757 6.708 9.092 13.117 8.489 9.432 10.511
B9.659 9.969 10.716 12.057 8.699 10.22 12.299
H9.782 7.112 6.65 13.694 7.607 8.969 9.752
South England U9.599 8.285 13.656 14.673 11.061 11.455 13.617
B13.536 9.209 16.188 14.279 8.531 12.348 12.977
H9.86 7.881 13.448 14.34 8.872 10.88 12.768
Tot al R MS E U10.426 8.056 13.29 13.699 10.222 11.139 12.438
B11.806 10.536 16.93 12.958 10.986 12.643 13.815
H9.359 7.971 12.094 13.475 9.93 10.566 11.617
on average more features (compared to the rainfall case) are being selected. This is
due to either the existence of only one significant flu period in the ground truth data
or the general inadequacy of 1-grams to describe the underlying topic as effectively as
in the previous case study.
Table VII holds the performance results for all rounds of the cross validation. For a
more comprehensive interpretation of the numerical values consider that the average
ILI rate across the regions used in our experiments is equal to 26.659 with a standard
deviation of 29.270 and ranges in [2,172]. Again, feature class Uperforms better than
B, whereas Houtperforms Uand B. In the regional results per fold, class Hhas the
best performance 8 times, B5 times and Uonly 2 times (out of 15 subcases in total).
Similarly to the previous case study, our method improves on the performance of the
baseline approach by a factor of 9.05%.
For this case study, we present all intermediate results for cross validation’s round 1.
Tables VIII, IX, and X show the selected features for U,Band Hfeature classes re-
spectively. From the selected 1-grams (Table VIII), stem ‘irrig’8has the largest weight.
Many illness related markers have been selected such as ‘cough’, ‘health’, ‘medic’,
‘nurs’, ‘throat’ and so on, but there exist also words with no clear semantic relation.
Surprisingly, stem ‘flu’ has not been selected as a feature in this round (and has only
been selected in round 5). On the contrary, almost all selected 2-grams (Table IX) can
be considered as flu-related; ‘confirm swine’ has the largest weight for both feature
classes Band H(Table X and Figure 5). As a general remark, keeping in mind that
8Irrigation describes the procedure of cleaning a wound or body organ by flushing or washing out with water
or a medicated solution (WordNet).
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:16 V. Lampos and N. Cristianini
Table VIII.
Feature Cla ss U– 1-grams selected by Bolasso for Flu case study (Round 1 of 5-fold cross validation) – All weights (w)
should be multiplied by 104.
1-gram w 1-gram w 1-gram w 1-gram w 1-gram w
acut 1.034 cleav 0.735 hippocr 6.249 properti 0.66 speed 0.286
afford 0.181 complex 0.499 holidai 0.017 psycholog 1.103 spike 0.145
allergi 2.569 cough 0.216 huge 0.33 public 0.212 stage 0.109
approv 0.672 cruis 1.105 irrig 10.116 radar 0.284 strength 0.873
artifici 2.036 daughter 0.187 item 0.337 reach 0.247 strong 0.336
assembl 0.589 dilut 4.165 knock 0.261 reliev 0.254 swine 1.262
asthmat 4.526 drag 0.098 lethal 0.73 remain 0.755 tast 0.13
attempt 0.375 erad 0.201 major 0.367 rough 0.068 team 0.031
behavior 1.747 face 0.008 medic 1.06 run 0.242 throat 0.07
better 0.066 fellow 0.542 member 0.354 rush 0.159 tissu 0.533
bind 0.675 fluid 2.002 mercuri 0.588 scari 0.198 transmit 1.352
blood 0.059 fuss 0.575 metro 0.397 seal 0.161 troop 0.532
boni 1.308 germ 0.211 mile 0.081 season 0.103 typic 0.585
bulg 0.966 guilti 0.608 miss 0.071 seizur 2.448 underli 0.774
caution 2.578 habit 0.619 nurs 0.223 self 0.127 unquot 8.901
cellular 2.125 halt 1.472 perform 0.084 sik 0.634 upcom 0.642
checklist 1.494 harbour 0.472 personnel 1.451 site 0.042 wave 0.042
chicken 0.317 health 0.241 pictur 0.134 soak 0.413 wikipedia 0.824
Ta b l e I X .
Feature Class B– 2-grams selected by Bolasso for Flu case study (Round 1 of 5-fold cross validation) – All weights (w)
should be multiplied by 104.
2-gram w 2-gram w 2-gram w 2-gram w
case swine 12.783 flu bad 6.641 need take 0.887 talk friend 4.9
check code 6.27 flu jab 4.66 pain night 14.149 time knock 10.002
check site 0.568 flu relat 10.948 physic emotion 7.95 total cost 11.582
confirm swine 31.509 flu symptom 7.693 sleep well 1.319 underli health 25.535
cough fit 7.381 flu web 8.017 sore head 4.297 virus epidem 28.204
cough lung 7.974 ground take 15.208 spread viru 20.871 visit doctor 12.327
cough night 16.73 health care 0.636 stai indoor 5.482 weight loss 0.447
die swine 9.722 healthcar worker 3.876 suspect swine 3.863 woke sweat 33.133
effect swine 27.675 home wors 22.167 swine flu 1.153 wonder swine 11.5085
feel better 0.655 ion channel 9.755 symptom swine 5.895
feel slightli 1.712 kick ass 0.335 take care 0.382
those features have been selected using data containing one significant flu period, they
cannot be considered as very generic ones.
Regional inference results are presented on Figures 6, 7, and 8 for classes U,B,
and Hrespectively. There is a clear indication that the inferred signal has a strong
correlation with the actual one; for instance, for feature class H(Figure 8) the lin-
ear correlation coefficients between the inferred and the actual flu rate for Central
England & Wales, North England and South England are equal to 0.933, 0.855 and
0.905 respectively. Using all folds of the cross validation, the average linear correla-
tion for classes U,B,and His equal to 0.905, 0.868 and 0.911 respectively, providing
additional evidence for the significance of the inference performance.9
Finally, we present some additional experimental results where training, validating
and testing have been carried out in a contiguous time wise manner. From the 303
9All p-values for the correlation coefficients listed are 0.05 indicating statistical significance.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:17
Ta b l e X .
Feature Class H– Hybrid selection of 1-grams and 2-grams for Flu case study (Round 1 of 5-fold cross validation)
All weights (w) should be multiplied by 104.
n-gram w n-gram w n-gram w n-gram w
acut 0.796 effect swine 19.835 medic 0.48 spike 0.032
afford 0.106 erad 0.27 member 0.169 spread viru 12.918
allergi 2.332 face 0.012 mercuri 0.414 stage 0.101
approv 0.516 feel better 0.15 metro 0.365 stai indoor 1.969
artifici 1.319 feel slightli 0.775 mile 0.092 strength 0.739
assembl 0.231 fellow 0.319 miss 0.073 strong 0.018
asthmat 2.607 flu bad 4.953 need take 0.759 suspect swine 2.503
attempt 0.322 flu jab 0.11 nurs 0.118 swine 0.203
behavior 1.349 flu relat 3.183 pain night 9.823 swine flu 1.577
bind 0.437 flu symptom 1.471 perform 0.083 symptom swine 1.626
blood 0.05 flu web 5.463 personnel 1.359 take care 0.21
boni 0.984 fluid 1.87 physic emotion 6.192 talk friend 2.518
bulg 0.733 fuss 0.234 pictur 0.124 tast 0.08
case swine 4.282 germ 0.111 properti 0.372 team 0.044
caution 1.174 ground take 3.022 radar 0.287 throat 0.251
cellular 2.072 guilti 0.394 reach 0.201 time knock 6.523
check code 4.495 habit 0.381 remain 0.666 tissu 0.012
check site 0.149 halt 0.819 rough 0.075 total cost 4.794
checklist 1.595 health 0.04 run 0.143 transmit 1.535
chicken 0.286 health care 0.393 rush 0.07 troop 0.767
cleav 0.991 healthcar worker 1.339 scari 0.109 underli 0.221
confirm swine 21.874 hippocr 6.038 seal 0.091 underli health 11.707
cough 0.234 holidai 0.021 season 0.064 unquot 8.753
cough fit 2.395 home wors 6.302 seizur 2.987 upcom 0.071
cough lung 2.406 huge 0.199 self 0.059 viru epidem 8.805
cough night 6.748 ion channel 4.974 sik 0.542 visit doctor 3.456
cruis 1.186 irrig 8.721 site 0.06 wave 0.033
daughter 0.048 item 0.219 sleep well 0.753 weight loss 0.296
die swine 0.196 kick ass 0.15 soak 0.41 wikipedia 0.66
dilut 2.708 knock 0.24 sore head 2.023 woke sweat 19.912
drag 0.147 major 0.376 speed 0.198 wonder swine 7.266
days of data, we used days 61–90 for validating CT, 91–121 for testing (from Septem-
ber 19 to October 19, 2009) and the remaining days have been used for training. In
this formation setting, we train on data from the Swine Flu epidemic period and then
test on a period where influenza existed but its rate was within a normal range. In
Figure 9, we show the inference outcome for South England for all feature classes.10
We have also included a smoothed representation of the inferences (using a 7-point
moving average) to induce a weekly trend. Class Hhas the best performance; in this
example, class Bperforms better than U.
7. DISCUSSION
The experimental results provided practical proof for the effectiveness of our method
in two case studies: rainfall and flu rates inference. Rain and flu are observable pieces
of information available to the general public and therefore are expected to be parts of
10Results for the other 2 regions were similar. South England region was chosen because it is the one with
the highest population (as it includes the city of London).
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:18 V. Lampos and N. Cristianini
Fig. 5. Table X in a word cloud, where font size is proportional to regression’s weight and flipped words
have negative weights.
Fig. 6. Feature Class U– Inference for Flu case study (Round 1 of 5-fold cross validation).
Fig. 7. Feature Class B– Inference for Flu case study (Round 1 of 5-fold cross validation).
discussions in the social Web media. While samples from both rainfall and ILI rates
can be described by exponentially distributed random variables, those two phenom-
ena have a distinctive property. Precipitation, especially in the UK, is rather unsta-
ble, that is, prone to daily changes, whereas a flu rate evolves much more smoothly.
Figures 10(a) (rainfall in London) and 10(b) (flu rate in South England) provide a clear
picture for this. Consequently, rainfall rates inference is a much harder problem: users
discussing the weather of a preceding day or the forecast for the next one, especially
when the current weather conditions contradict, affect not only the inference process
but learning as well. For example, in the 6th round of the cross-validation, where
we derived the worst inference performance, we see that in the VSRs of the test set
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:19
Fig. 8. Feature Class H– Inference for Flu case study (Round 1 of 5-fold cross validation).
Fig. 9. Flu inference results for continuous training, validating and testing sets for South England –Testing
is performed on data from the 19th of September to the 19th October, 2009.
Fig. 10. Comparing smoothness of ground truth between the two case studies.
(which includes 67 rainy out of 155 days in total for all 5 locations), 1-gram ‘flood’
has the exact same average frequency during rainy and non rainy days; furthermore,
the average frequency of stem ‘rain’ in days with no rain was equal to 68% of the one
in rainy days. Similar statistics are also observed in the training set or for 2-grams;
for instance, the average frequencies of ‘rain hard’ and ‘pour rain’ in the training set
(716/1515 rainy days) for nonrainy days are equal to 42% and 13% of the ones in rainy
days respectively.
The proposed method is able to overcome those tendencies by selecting features with
a more stable behavior to the extent possible. However, the figures in the two previous
Sections make clear that inferences have a higher correlation with the ground truth in
the flu case study; even when deploying a randomly permuted version of the dataset,
which in turn encapsulates only one major flu period, and therefore is of worse quality
compared to the rainfall data. Based on those experimental results and the properties
of the target events that reach several extremes, we argue that the proposed method
is applicable to other events as well, which are at least drawn from an exponential
distribution.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:20 V. Lampos and N. Cristianini
Another important point in this methodology regards the feature extraction ap-
proach. A mainstream information retrieval technique implies the formation of a vo-
cabulary index from the entire corpus [Manning et al. 2008]. Instead, we have chosen
to form a more focused and restricted in numbers set of candidate features from online
references related with the target event, a choice justified by LASSO’s risk bound (see
Equation (10)). The short time span of our data limits the amount of training samples,
and therefore directs us in the choice of reducing the number of the candidate features
to minimize the risk error and avoid overfitting. Indeed, in the flu case study, where
only a small variation in the ILI rates is observed, when we formed an index from
the entire Twitter corpus, the method tended to select non-illness-related features as
well. Some of those features, for example, were describing a popular movie released in
July 2009, the same period with the peak in the flu rates signal. By having fewer and
slightly more focused on the target event’s domain candidates, we constrain the di-
mensionality over training samples ratio and this issue is resolved. Nevertheless, the
size of 1-gram vocabularies in both case studies was not small (approx. 2400 words)
and 99% of the daily tweets for each location or region contained at least one candidate
feature. However, for 2-grams this proportion was reduced to 1.5% and 3% for rainfall
and flu rates case studies respectively, meaning that this class of features required a
much higher number of tweets in order to properly contribute.
The experimental process made also clear that a manual selection of very obvious
keywords that logically describe a topic, such as ‘flu’ or ‘rain’, might not be optimal es-
pecially when using 1-grams; more rare words (‘puddl’ or ‘irrig’) exhibited more stable
indications about the target events’ magnitude. Finally, it is important to note how CT
operates as an additional layer in the feature selection process facilitating the adapta-
tion on the special characteristics of each dataset. CT’s validation showed that a blind
application of strict bolasso (CT = 1) would not have performed as good as the relaxed
version we applied; only once in 22 validation sets the optimal value for CT was set
equal to 1.
8. CONCLUSIONS AND FUTURE WORK
We have presented a supervised learning framework for nowcasting events by ex-
ploiting unstructured textual information published on the social Web. The proposed
methodology is able to turn geo-tagged user posts on the microblogging service of
Twitter to topic-specific geolocated signals by selecting textual features that capture
semantic notions of the inference target. Sparse learning via a soft version of Bolasso,
the bootstrapped LASSO L1-norm regulariser, performs a consistent feature selection,
which increases the inference performance approx. by a factor of 10% compared to
previously proposed methods [Culotta 2010; Ginsberg et al. 2008].
We have displayed results drawn from two case studies, that is, the benchmark
problem of inferring rainfall rates and the real-life task of detecting the diffusion of
Influenza-like Illness from tweets. In both case studies, the majority of selected fea-
tures was directly related with the target topic and inference performance has been
significant; for instance, for the important task of nowcasting influenza, inferred flu
rates reached an average correlation of 91.11% with the actual ones. As expected, se-
lected 2-grams showed a better semantic connection with the target topics. However,
during inference they did not perform as well as 1-grams. Combining both feature
classes into a hybrid approach resulted in an overall better performance.
Future work could be focused on improving various subtasks in our methodology.
Feature extraction can become more sophisticated by identifying self diagnostic
statements in the corpus (e.g., “I got soaked today” or “I have a headache”) or by in-
corporating entities (gender, names, brands, etc.). Similarly to other work (mentioned
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
Nowcasting Events from the Social Web with Statistical Learning 72:21
in Section 2), using sentiment or mood analysis on the text can offer an additional di-
mension of input information. Exploiting the temporal behavior of an event combined
with more sophisticated inference techniques able to model non linearities or even a
generative approach, could also improve the inference performance as well as provide
interesting insights (e.g., the identification of latent variables that influence the
inference process). Finally, on a conceptual basis, the detection of multivariate signals,
where target variables may be interdependent (e.g., electoral voting intentions), could
form an interesting task for future research.
ACKNOWLEDGMENTS
The authors would like to thank Twitter Inc. and HPA for making their data publicly accessible, PASCAL2
Network for the continuous support, and Tijl De Bie for providing feedback in early stages of this work. We
are also grateful to the anonymous reviewers for their constructive feedback.
REFERENCES
ASUR,S.AND HUBERMAN, B. A. 2010. Predicting the future with social media. In Proceedings of the
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE,
492–499.
BACH, F. R. 2008. Bolasso: Model consistent Lasso estimation through the bootstrap. In Proceedings of the
25th International Conference on Machine Learning. 33–40.
BARTLETT,P.L.,MENDELSON,S.,AND NEEMAN, J. 2009. l1-regularized linear regression: Persistence and
oracle inequalities. Tech. rep., UC-Berkeley.
BOLLEN,J.,MAO,H.,AND ZENG, X. 2011. Twitter mood predicts the stock market. J. Co mpu t . Sci .
BREIMAN, L. 1996. Bagging predictors. Mach. Learn. 24, 2, 123–140.
CORLEY,C.D.,MIKLER,A.R.,SINGH,K.P.,AND COOK, D. J. 2009. Monitoring influenza trends through
mining social media. In Proceedings of the International Conference on Bioinformatics and Computa-
tional Biology. 340–346.
CULOTTA, A. 2010. Towards detecting influenza epidemics by analyzing Twitter messages. In Proceedings
of the KDD Workshop on Social Media Analytics.
EFRON, B. 1979. Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, 1, 1–26.
EFRON,B.AND TIBSHIRANI, R . J. 1993. An Introduction to the Bootstrap. Chapman & Hall.
EFRON,B.,HASTIE,T.,JOHNSTONE,I.,AND TIBSHIRANI, R. 2004. Least angle regression. Ann. Statist.
32, 2, 407–451.
GINSBERG,J.,MOHEBBI,M.H.,PATEL,R.S.,BRAMMER,L.,SMOLINSKI,M.S.,AND BRILLIANT, L. 2008.
Detecting influenza epidemics using search engine query data. Nature 457, 7232, 1012–1014.
GUYON,I.AND ELISSEEFF, A. 2003. An introduction to variable and feature selection. J. Mach. Learn.
Resear. 3, 7–8, 1157–1182.
JENKINS,G.J.,PERRY,M.C.,AND PRIOR, M. J. 2008. The Climate of the United Kingdom and Recent
Trends. Met Office, Hadley Centre, Exeter, UK.
LAMPOS,V.AND CRISTIANINI, N. 2010. Tracking the flu pandemic by monitoring the Social Web. In Pro-
ceedings of the 2nd IAPR Workshop on Cognitive Information Processing. IEEE Press, 411–416.
LAMPOS,V.,DEBIE,T.,AND CRISTIANINI, N. 2010. Flu detector—Tracking epidemics on Twitter. In Pro-
ceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases. Springer, 599–602.
LV,J.AND FAN, Y. 2009. A unified approach to model selection and sparse recovery using regularized least
squares. Ann. Statist. 37, 6A, 3498–3528.
MANNING,C.D.,RAGHAVAN,P.,AND SCH ¨
UTZE, H. 2008. Introduction to Information Retrieval. Cambridge
University Press.
PANG,B.AND LEE, L. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retriev. 2, 1–2,
1–135.
POLGREEN,P.M.,CHEN,Y.,PENNOCK,D.M.,NELSON,F.D.,AND WEINSTEIN, R. A. 2008. Using internet
searches for influenza surveillance. Clinical Infectious Diseases 47, 11, 1443–1448.
PORTER, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130–137.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
72:22 V. Lampos and N. Cristianini
SAKAKI,T.,OKAZAKI,M.,AND MATSUO, Y. 2010. Earthquake shakes Twitter users: Real-time event detec-
tion by social sensors. In Proceedings of the 19th International Conference on World Wide Web. 851–860.
TIBSHIRANI, R. 1996. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc. Series B
(Methodological) 58, 1, 267–288.
TUMASJAN,A.,SPRENGER,T.O.,SANDNER,P.G.,AND WELPE, I. M . 2010. Predicting elections with
Twitter: What 140 characters reveal about political sentiment. In Proceedings of the International AAAI
Conference on Weblogs and Social Media. 178–185.
ZHAO,P.AND YU, B. 2006. On model selection consistency of Lasso. J. Mach. Learn. Resear. 7, 11,
2541–2563.
Received April 2011; revised August 2011; accepted September 2011
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 72, Publication date: September 2012.
... The social sensing approach adopts the Human as a Sensor (HaaS) paradigm, framing humans as data sources (sensors) who occasionally post observations about the physical world on OSM [290]. Once collected, such observations can be processed to gain situational awareness and to nowcast events related to different domains, such as health [175], transportation [70], civil unrest [16], and crisis related to natural or man-made disasters [28,297]. In turn, by providing data-driven decision support to policymakers and stakeholders, social sensing can contribute to adjusting their interventions, thus closing once again the online/offline feedback loop [24,25]. ...
... These systems, however, demand geotagged data to be placed on crisis maps, which in turn imposes to perform the geoparsing task on the majority of social media content. Explicit geographic information is not only needed in early warning [28,297] and emergency response systems [26,27,98,169,265], but also for other important tasks related to different types of offline threats, such as monitoring the spread of epidemics [175] or preventing crimes [91,153,289]. Notably, geotagged OSM data are a necessary ingredient also for applications devoted to event promotion [238,303], touristic planning [57,70,80], healthcare accessibility [185], news aggregation [192] and verification [198], and demographic studies [173]. ...
Thesis
Full-text available
My thesis aims to exploit social media data and Artificial Intelligence (AI) techniques to combat possible online and offline threats. It therefore provides contributions in two directions: (i) uncovering and characterizing possible threats affecting ecosystems online and (ii) anchoring online information to events taking place offline, to improve our crisis response capacity. In both cases, on the one hand we help improve AI techniques, which enable essential applications; on the other, we design and apply global approaches to address specific threats.
... The advantage of this evaluation strategy is that, in general, time-wise splits are more representative of the conditions in which machine learning models are used, as they allow to test a model's ability to withstand issues that emerge through time, such as concept-drift (Lu et al., 2018). In turn, a model capable of withstanding such issues would open up the possibility to carry out longitudinal analyses and even to nowcast political leanings (Lampos & Cristianini, 2012;Avvenuti et al., 2017;Tsakalidis et al., 2018). For these reasons, we performed an additional experiment by evaluating our method in this, more stringent, condition. ...
Preprint
Predicting the political leaning of social media users is an increasingly popular task, given its usefulness for electoral forecasts, opinion dynamics models and for studying the political dimension of polarization and disinformation. Here, we propose a novel unsupervised technique for learning fine-grained political leaning from the textual content of social media posts. Our technique leverages a deep neural network for learning latent political ideologies in a representation learning task. Then, users are projected in a low-dimensional ideology space where they are subsequently clustered. The political leaning of a user is automatically derived from the cluster to which the user is assigned. We evaluated our technique in two challenging classification tasks and we compared it to baselines and other state-of-the-art approaches. Our technique obtains the best results among all unsupervised techniques, with micro F1 = 0.426 in the 8-class task and micro F1 = 0.772 in the 3-class task. Other than being interesting on their own, our results also pave the way for the development of new and better unsupervised approaches for the detection of fine-grained political leaning.
... The advantage of this evaluation strategy is that, in general, time-wise splits are more representative of the conditions in which machine learning models are used, as they allow to test a model's ability to withstand issues that emerge through time, such as concept-drift (Lu et al., 2018). In turn, a model capable of withstanding such issues would open up the possibility to carry out longitudinal analyses and even to nowcast political leanings (Lampos & Cristianini, 2012;Avvenuti et al., 2017;Tsakalidis et al., 2018). For these reasons, we performed an additional experiment by evaluating our method in this, more stringent, condition. ...
Article
Predicting the political leaning of social media users is an increasingly popular task, given its usefulness for electoral forecasts, opinion dynamics models and for studying the political dimension of polarization and disinformation. Here, we propose a novel unsupervised technique for learning fine-grained political leaning from the textual content of social media posts. Our technique leverages a deep neural network for learning latent political ideologies in a representation learning task. Then, users are projected in a low-dimensional ideology space where they are subsequently clustered. The political leaning of a user is automatically derived from the cluster to which the user is assigned. We evaluated our technique in two challenging classification tasks and we compared it to baselines and other state-of-the-art approaches. Our technique obtains the best results among all unsupervised techniques, with micro F1 = 0.426 in the 8-class task and micro F1 = 0.772 in the 3-class task. Other than being interesting on their own, our results also pave the way for the development of new and better unsupervised approaches for the detection of fine-grained political leaning.
... Social media data allow disease tracking (80,81) and can help make predictions that could prevent danger to the population (3). Social media analytics in correlation with traditional laboratory data can predict an outbreak; examples of this include the cases of influenza and cholera (82) or Ebola and Marburg filoviruses (83). ...
Article
Full-text available
Background: In order to prevent spread and improve control of infectious diseases, public health experts need to closely monitor human and animal populations. Infectious disease surveillance is an established, routine data collection process essential for early warning, rapid response, and disease control. The quantity of data potentially useful for early warning and surveillance has increased exponentially due to social media and other big data streams. Digital epidemiology is a novel discipline that includes harvesting, analysing, and interpreting data that were not initially collected for healthcare needs to enhance traditional surveillance. During the current COVID-19 pandemic, the importance of digital epidemiology complementing traditional public health approaches has been highlighted. Objective: The aim of this paper is to provide a comprehensive overview for the application of data and digital solutions to support surveillance strategies and draw implications for surveillance in the context of the COVID-19 pandemic and beyond. Methods: A search was conducted in PubMed databases. Articles published between January 2005 and May 2020 on the use of digital solutions to support surveillance strategies in pandemic settings and health emergencies were evaluated. Results: In this paper, we provide a comprehensive overview of digital epidemiology, available data sources, and components of 21st-century digital surveillance, early warning and response, outbreak management and control, and digital interventions. Conclusions: Our main purpose was to highlight the plausible use of new surveillance strategies, with implications for the COVID-19 pandemic strategies and then to identify opportunities and challenges for the successful development and implementation of digital solutions during non-emergency times of routine surveillance, with readiness for early-warning and response for future pandemics. The enhancement of traditional surveillance systems with novel digital surveillance methods opens a direction for the most effective framework for preparedness and response to future pandemics.
... Deep learning is used in this system to analyze data and determine the prevalence of a disease. Lampos and Cristianini are used in [25] social network users as sensors to predict real-time events such as rainfall prediction. ...
Book
Full-text available
This edited collection attempted to explore the new post-pandemic health realities with a focus on various policy and response strategies spearheaded by the countries across the world to make public healthcare services more accessible to citizens in terms of cost, security and data privacy issues. The diverse studies published under the research topic collection looked into the new structural and institutional shifts taking place to make healthcare services safe and convenient with the help of empowered frontline care leveraging technologies. To make the healthcare services more inclusive and accessible, healthcare organizations and institutions, both public and private need to orchestrate the myriad interconnected changes required to design, implement and sustain digitally-enabled healthcare delivery platforms. The edited volume addressed various policy and response strategies adopted to deal with restricted physical access to socio-economic infrastructure, facilities and services amid the pandemic with a focus on cutting-edge health-technologies at its core. The topic collection strongly advocates that health technologies and innovations are going to be one of the significant sectors for investment and innovations over the next 30 years. This will not only transform the global health care sector in terms of diagnosis, disease management, treatment and prevention but also help to better prepare for future emergencies.
Preprint
The Web Based File Clustering and Indexing for Mindoro State University aim to organize data circulated over the Web into groups or collections to facilitate data availability and access and at the same time meet user preferences. The main benefits include increasing Web information accessibility, understanding users navigation behavior, improving information retrieval and content delivery on the Web. Web based file clustering could help in reaching the required documents that the user is searching for. In this paper a novel approach has been introduced for search results clustering that is based on the semantics of the retrieved documents rather than the syntax of the terms in those documents. Data clustering was used to improve the information retrieval from the collection of documents. Data were processed and analyzed using SPSS where the instrument was evaluated to test the reliability and validity of the measures used. Evaluation was based on a Likert scale of Excellent, Good, Fair, and Poor as described for the selected quality characteristics.
Article
Full-text available
Purpose – The Web-Based File Clustering and Indexing for Mindoro State University aim to organize data circulated over the Web into groups/collections to facilitate data availability and access, and at the same time meet user preferences. The main benefits include: increasing Web information accessibility, understanding users’ navigation behavior, improving information retrieval and content delivery on the Web. Web-based file clustering could help in reaching the required documents that the user is searching for. Method – In this paper a novel approach has been introduced for search results clustering that is based on the semantics of the retrieved documents rather than the syntax of the terms in those documents. Data clustering was used to improve the information retrieval from the collection of documents. Data were processed and analyzed using SPSS (version 18) where the instrument was evaluated to test the reliability and validity of the measures used. Evaluation was based on a Likert scale of Excellent, Good, Fair, and Poor as described for the selected quality characteristics. Results – A total of 200 questionnaires were distributed with a return rate of 100%. The questionnaire was tested 0.735 using Cronbach’s Alpha Coefficient and considered a reliable instrument. Four quality characteristics were evaluated in this study; Usability, Performance Efficiency, Reliability, and Functionality Suitability. Conclusion - The Web-based file clustering could help in reaching the required documents that the user is searching for. The need for an information retrieval mechanism can only be supported if the document collection is organized into a meaningful structure, which allows part or all the document collection to be browsed at each stage of a search. Recommendations – It is recommended that upon uploading of file it will show the use of the file and where it is originated (department). It is also recommended to create an index to cluster not only the file type but also the content and use of a file. Explore the clustering to a wider scope. Practical Implications – Document clustering provides a structure for organizing large bodies of text for efficient browsing and searching and helps a lot for the Mindoro State University for records/ document processing. Indexing is the best tool to maintain uniqueness of records in a database. Whenever new files or records are created, it can be easily added to the index. This makes it easy to keep documents up-to-date at all times. Grouping documents into two or more categories improves search time and makes life easier for everyone.
Article
Full-text available
Compiling and disseminating information about incidents and disasters are key to disaster management and relief. But due to inherent limitations of the acquisition process, the required information is often incomplete or missing altogether. To fill these gaps, citizen observations spread through social media are widely considered to be a promising source of relevant information, and many studies propose new methods to tap this resource. Yet, the overarching question of whether and under which circumstances social media can supply relevant information (both qualitatively and quantitatively) still remains unanswered. To shed some light on this question, we review 37 disaster and incident databases covering 27 incident types, compile a unified overview of the contained data and their collection processes, and identify the missing or incomplete information. The resulting data collection reveals six major use cases for social media analysis in incident data collection: (1) impact assessment and verification of model predictions, (2) narrative generation, (3) recruiting citizen volunteers, (4) supporting weakly institutionalized areas, (5) narrowing surveillance areas, and (6) reporting triggers for periodical surveillance. Furthermore, we discuss the benefits and shortcomings of using social media data for closing information gaps related to incidents and disasters.
Article
Full-text available
The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
In the paper I give a brief review of the basic idea and some history and then discuss some developments since the original paper on regression shrinkage and selection via the lasso.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.