ArticlePDF Available

Programmed Method: Developing a Toolset for Capturing and Analyzing Tweets

Authors:

Abstract and Figures

Purpose – The purpose of this paper is to introduce Digital Methods Initiative Twitter Capture and Analysis Toolset, a toolset for capturing and analyzing Twitter data. Instead of just presenting a technical paper detailing the system, however, the authors argue that the type of data used for, as well as the methods encoded in, computational systems have epistemological repercussions for research. The authors thus aim at situating the development of the toolset in relation to methodological debates in the social sciences and humanities. Design/methodology/approach – The authors review the possibilities and limitations of existing approaches to capture and analyze Twitter data in order to address the various ways in which computational systems frame research. The authors then introduce the open-source toolset and put forward an approach that embraces methodological diversity and epistemological plurality. Findings – The authors find that design decisions and more general methodological reasoning can and should go hand in hand when building tools for computational social science or digital humanities. Practical implications – Besides methodological transparency, the software provides robust and reproducible data capture and analysis, and interlinks with existing analytical software. Epistemic plurality is emphasized by taking into account how Twitter structures information, by allowing for a number of different sampling techniques, by enabling a variety of analytical approaches or paradigms, and by facilitating work at the micro, meso, and macro levels. Originality/value – The paper opens up critical debate by connecting tool design to fundamental interrogations of methodology and its repercussions for the production of knowledge. The design of the software is inspired by exchanges and debates with scholars from a variety of disciplines and the attempt to propose a flexible and extensible tool that accommodates a wide array of methodological approaches is directly motivated by the desire to keep computational work open for various epistemic sensibilities.
Content may be subject to copyright.
Programmed method: developing
a toolset for capturing and
analyzing tweets
Erik Borra and Bernhard Rieder
Department of Media Studies, University of Amsterdam,
Amsterdam, The Netherlands
Abstract
Purpose The purpose of this paper is to introduce Digital Methods Initiative Twitter Capture and
Analysis Toolset, a toolset for capturing and analyzing Twitter data. Instead of just presenting a
technical paper detailing the system, however, the authors argue that the type of data used for, as well
as the methods encoded in, computational systems have epistemological repercussions for research.
The authors thus aim at situating the development of the toolset in relation to methodological debates
in the social sciences and humanities.
Design/methodology/approach The authors review the possibilities and limitations of existing
approaches to capture and analyze Twitter data in order to address the various ways in which
computational systems frame research. The authors then introduce the open-source toolset and put
forward an approach that embraces methodological diversity and epistemological plurality.
Findings The authors find that design decisions and more general methodological reasoning can
and should go hand in hand when building tools for computational social science or digital humanities.
Practical implications Besides methodological transparency, the software provides robust and
reproducible data capture and analysis, and interlinks with existing analytical software. Epistemic
plurality is emphasized by taking into account how Twitter structures information, by allowing for a
number of different sampling techniques, by enabling a variety of analytical approaches or paradigms,
and by facilitating work at the micro, meso, and macro levels.
Originality/value The paper opens up critical debate by connecting tool design to fundamental
interrogations of methodology and its repercussions for the production of knowledge. The design of
the software is inspired by exchanges and debates with scholars from a variety of disciplines and
the attempt to propose a flexible and extensible tool that accommodates a wide array of
methodological approaches is directly motivated by the desire to keep computational work open for
various epistemic sensibilities.
Keywords Twitter, Computational social science, Data collection, Analysis, Digital humanities,
Digital methods
Paper type Conceptual paper
1. Introduction
The relatively recent flourishing of computer-supported approaches to the study of
social and cultural phenomena digital methods (Rogers, 2013), computational social
science (Lazer et al., 2009), digital humanities (Kirschenbaum, 2010), each with their
set of significant precursors has led to an encounter between technology and
methodology that deeply affects the status and practice of research in the social
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/2050-3806.htm
Received 15 September 2013
Revised 20 December 2013
19 February 2014
27 February 2014
Accepted 11 March 2014
Aslib Journal of Information
Management
Vol. 66 No. 3, 2014
pp. 262-278
r Emerald Group Publishing Limited
2050-3806
DOI 10.1108/AJIM-09-2013-0094
The authors would like to thank the special issue editors and the anonymous reviewers, as well
as Noortje Marres, David Moats, Richard Rogers, Natalia Sanchez Querubin, Emma Uprichard,
Lonneke van der Velden, and Esther Weltevrede for their useful comments, and Emile den Tex
for his technical contributions. This project has benefited from a grant by the ESRC Digital
Social Research Programme and the Centre for the Study of Invention and Social Process
(Goldsmiths, University of London).
262
AJIM
66,3
sciences and humanities. Statistics, modeling, and other formal methods have
introduced strong elements of technicality long ago. But the study of very large sets
of highly dynamic data, which, unlike, e.g. surveys, are not explicitly produced for
scientific study, institutes computing as a methodological mediator (Latour, 2005), and
brings along ideas, artifacts, practices, and logistics tied to the technological in a more
far-reaching and radical fashion. Packages like SPSS have enabled, broadened, and
standardized the use of computers and software in social research since the 1960s
(Uprichard et al., 2008). The recent explosion in research employing data analysis
techniques, often focussed on social media and other online phenomena, however,
propels questions of toolmaking software design, implementation, maintenance,
etc. into the center of methodological debates and practices.
A number of commentators (Boyd and Crawford, 2012; Rieder and Ro
¨
hle, 2012;
Puschmann and Burgess, 2014) have called attention to the issues arising from the use
of software to study data extracted from (mostly proprietary) software platforms that
enable and orchestrate expressions and interactions of sometimes hundreds of millions
of users. These issues include the methodological, epistemological, logistical, legal,
ethical, and political dimensions of what is increasingly referred to as “big data”
research. While such critical interrogation is necessary and productive, in this paper
we take a different approach to some of the issues raised, by introducing and
discussing an open-source, freely available data capture, and analysis platform for
the Twitter micro blogging service, the Digital Methods Initiative Twitter Capture and
Analysis Toolset (DMI-TCAT)[1]. Although we do not envision this research software
to be a “solution” to the many questions at hand, it encapsulates a number of
propositions and commitments that are indeed programmatic beyond the mere
technicalities at hand. A presentation of such a tool cannot leave technical matters
aside, but in this paper we attempt to productively link them to some of the broader
repercussions of software-based research of social and cultural phenomena.
Although Cioffi-Revilla’s assessment that “computational social science is an
instrument-enabled scientific discipline, in this respect scientifically similar to
microbiology, radio astronomy, or nanoscience” (Cioffi-Revilla, 2010, p. 260) needs
to be nuanced, the argument that “it is the instrument of investigation that drives the
development of theory and understanding” (Cioffi-Revilla, 2010) is not easy to dismiss
when looking at research dealing with Twitter. The large number and wide variety of
computational approaches, their status as mostly experimental tools, their application
in disciplines often unaccustomed to computational principles, the pervasiveness of
social media, and the speed of technological change all these elements require us to
pay much more attention to our instruments than we have been accustomed to. Having
built such an instrument, we feel obliged to go beyond the presentation of architecture
or results and account for the way we think or hope – that our tool “drives” research
in a more substantial way than solely solving particular technical and logistical
problems. This desire possibly betrays our disciplinary affiliation. Media studies, in
particular in its humanities bent, has long focussed on analyzing technologies
as media, that is, as artifacts or institutions that do not merely transport information,
but, by affecting the scale, speed, form, in short, the character of expression and
interaction, contribute to how societies and cultures assemble, operate, and produce
knowledge. Just as Winner (1980) pointed out that tools have politics too, we consider
a research toolset such as DMI-TCAT to have epistemic orientations that have
repercussions for the production of academic knowledge. Rather than glossing over
them, we want to bring them to the front.
263
Toolset for
capturing and
analyzing tweets
With these elements in mind, our paper proceeds in two distinct steps:
.
We briefly summarize existing tools and approaches for Twitter analysis,
discuss how they relate to academic research, and develop a set of guidelines or
principles for our own contribution along the way.
.
We present the design and architecture of DMI-TCAT, show how it addresses
the concerns raised, and detail the analytical possibilities for Twitter research
it provides. Through all of this, the relationship between toolmaking and
methodology remains in focus.
2. Existing work
When highlighting the emerging antagonism between “Big Data rich” and “Big Data
poor,” Boyd and Crawford (2012) cite Twitter researcher Jimmy Lin as discouraging
“researchers from pursuing lines of inquiry that internal Twitter researchers could do
better.” This quote echoes at least if taken out of context Savage and Burrows’
(2007) diagnosis of a “coming crisis in empirical sociology”: a marginalization of
academic empirical work due to the ever increasing capacity and inclination
of “knowing capitalism” (Thrift, 2005) to collect large amounts of data and to deploy
a variety of methods to analyze them. However, instead of advocating retreat into the
realms of synthesizing theory, they call for “greater reflection on how sociologists can
best relate to the proliferation of social data gathered by others” (Savage and Burrows,
2007, p. 895) and for renewed involvement with the “politics of method” (p. 895) in both
academic and private research. Rather than leaving areas like social media research to
in-house scientists and marketers, we should be “critically engaging with the extensive
data sources which now exist, and not least, campaigning for access to such data where
they are currently private” (p. 896). We could not agree more with this assessment
and would like to emphasize that the crisis Savage and Burrows diagnose goes beyond
the question of access to data. The proliferation of actors involved in the analysis of
online data private and academic, coming from a wide variety of disciplines has
led to the formation of an epistemological battlefield where different paradigms,
methods, styles, and objectives struggle for interpretive agency, i.e. for the power to
produce (empirical) accounts of the ever expanding online domain. To be clear: the
various technical, legal, logistical, and even ethical stumbling blocks for data analysis,
and the ways in which the various actors are able or decide to react to them, have very
real consequences for the actual knowledge produced and circulated.
In order to situate our own contribution and to develop a number of guiding
principles we need to provide a short overview of existing strategies in Twitter
research and discuss their limitations. The rough groupings we make, which revolve
around logistical questions, precede concrete research designs and rather define
a particular methodological space in which such concrete designs are then formulated.
Twitter’s in-house research projects, or projects with cooperation agreements, have
direct access to the full Twitter archive and are in the luxurious position to not have to
worry about access to data, data completeness, or technical limitations although legal
and ethical considerations linger. At the same time, they are utterly dependent on the
good will of the company. Their academic independence in terms of subject focus is
doubtful, the tools and techniques used are often proprietary and can thus not be
scrutinized, and only few projects will actually be selected in the first place.
Projects acquiring data through resellers such as DataSift or Gnip also gain access
to the full archive of tweets and their metadata. Cost, however, is the main limiting
264
AJIM
66,3
factor to this approach: the pricy subscription models to those services are out of reach
for small- and mid-sized research groups. As Twitter donated its entire archive to the
US Library of Congress, a viable and cheap alternative may become available in
the future. Any project working with data sourced from these archives, however,
will have to rely on custom programming for analysis.
This brings us to online analytics platforms, which provide simple interfaces for
both data acquisition and analysis, and are oriented toward either academic
(e.g. DiscoverText[2], Truthy[3]) or commercial research (e.g. Topsy[4], Twitonomy[5],
Hootsuite[6]). Those who interface with data resellers, such as DiscoverText, are again
costly, but services collecting data through Twitter’s Application Programming
Interfaces (APIs) are often available for free or at reduced cost. These platforms and
their dashboard-like interfaces can be very practical and useful, but are generally
limited in terms of their analytical capacities, cannot be easily extended, and allow for
little or no data export to stand-alone analytics software. Most problematic is the fact
that they blackbox a large part of the research chain and generally follow a particular
paradigmatic orientation. We do not think that commercial platforms should be
dismissed outright, but it is clear that they are mainly focussing on the requirements
of marketing professionals, emphasizing lists of “top” or “influential” users and content
items. More academic platforms equally subscribe to specific paradigmatic approaches
coming with prior assumptions about both data, e.g. Truthy considering spam as noise
(McKelvey and Menczer, 2013), and method, e.g. DiscoverText focussing on the
classification of tweets and Truthy on information diffusion. As such, researchers
are restricted to their premises and analytical techniques.
If all that is needed is a set of tweets matching certain keywords, e.g. all the tweets
containing a hashtag for a specific event, ad hoc or project-based custom capturing
tools such as ScraperWiki[7], Google spreadsheets, or streamR[8] are commonly used.
Just like custom programming, this approach affords flexibility, transparency,
and control, but results may be difficult to verify or reproduce, bugs can occur, and
significant technical skill needs to be acquired.
Two well-known examples of open-source capturing software, an approach that
retains transparency and (some) flexibility and control while reducing the need for
technical expertise, are 140kit[9] and TwapperKeeper. Both started out as public online
services to capture, export, and in the case of 140kit – analyze tweets, but had to close
down when Twitter changed its terms of service in 2011. The source code for both
projects was published online and yourTwapperKeeper[10] (yTK) in particular has
been used by many humanities and social science scholars to capture tweets (see, e.g.
Bruns and Liang, 2012). To facilitate and standardize research with yTK, which comes
without built-in analytics, Bruns and Burgess (2012) published a set of useful GAWK
scripts and we therefore initially used yTK to capture tweets. But the less technically
inclined humanities and social science scholars we often work with found the GAWK
scripts too difficult to handle. Our attempt to build a simpler analytics platform on top
of yTK proved difficult: its database structure is not designed for fast analysis
and omits many fields returned by the API[11]; its codebase is not updated on a regular
basis; data is not stored as UTF-8 and languages using non-Latin character sets thus
cannot be analyzed. Finally, we not only wanted to capture and analyze keyword based
samples of tweets but also user timelines, 1 percent samples, follower networks, and
other types of data available through Twitter’s API.
Reviewing the possibilities and limitations of existing tools led us to the decision
to build our own capture and analysis platform from the ground up. It also allowed us
265
Toolset for
capturing and
analyzing tweets
to develop a set of guiding principles that translate into a series of decisions or
commitments on three interrelated levels. Concerning logistics, we attempt to lower the
barrier of entry to Twitter research by providing a freely available platform built
on publicly available data which requires little or no custom programming and scales
to data sets of hundreds of millions of tweets using consumer hardware. Regarding
epistemology, our tool emphasizes epistemic plurality by staying close to the units
defined by the Twitter platform instead of storing aggregates, by allowing for a
number of different sampling techniques, by enabling a variety of analytical
approaches or paradigms, and by facilitating work at the micro, meso, and macro
levels. On the level of methodology, finally, we provide robust and reproducible
data capture and analysis, allow easy import and export of data, interlink with existing
analytics software, and guarantee methodological transparency by publishing the
source code.
In the next section, we provide a more detailed description of our system and show
how these guiding principles have been translated into concrete design decisions.
3. DMI-TCAT
In line with the general architecture of DMI-TCAT, this presentation is divided into
data capture, the way data are retrieved, enriched, and stored in a database, and data
analysis, which includes all analytical operations that can be performed on the stored
elements. While these two aspects have been developed in tandem, they are mostly
independent: it is possible to use the toolset to only capture data, e.g. as alternative to
yTK, or to only analyze them, e.g. by importing a data set captured with yTK. Figure 1
provides a basic overview of the system.
DMI-TCAT is written in PHP and organized around a MySQL database
positioned between the capture and analysis parts of the system. Data are retrieved
by different modules controlled in regular intervals by a supervisor script (using the
cron scheduler present in all Unix-like operating systems), which checks whether
the capturing processes are running and, if necessary, restarts them. A separate script
translates shortened URLs. Database contents are analyzed in a two-stage process: the
selection of a subsample precedes the application of various analytical techniques.
In the following section, the various techniques for data capture are discussed in
more depth.
3.1 Data capture
Sampling, i.e. the selection of items from a population of cases or elements, is a central
concern when using online data. Because we are essentially dealing with data stored in
cron
REST
stream
search
import user timeline
import tweets by id
import user network
1% sample
filter keywords
follow users
tweets
tweet entities
enriched data
query
exclude
users
date
...
frequency
tweets
users
networks
...
capture
Klout
query admin
query bins
sub-sampling analysis
expand URLs
Figure 1.
Schema of the general
architecture of DMI-TCAT
266
AJIM
66,3
information systems, in this case Twitter’s database, the properties of these systems
determine, to a large extent, selection and retrieval possibilities. More fundamentally,
they imply that the sociality they enable is structured and formalized: platforms like
Twitter define basic entities (tweets, users, lists, hashtags, etc.), their characteristics
(a tweet is no longer than 140 characters, a user account is associated with an image,
etc.), and possible actions (a tweet can be retweeted, a user followed, etc.). The
information system therefore goes a long way in framing what Uprichard calls
“the ontology of the case” (Uprichard, 2013), simply by defining which entities can
appear as a case in the first place and subsequently become part of a sample. As noted
earlier, researchers in media studies have long recognized that media itself affect the
character of expression and interaction passing through it. While, e.g. Facebook
provides a formal and functional definition of “group,” Twitter does not. That does not
mean that a research design cannot operationalize such a construct by other means,
e.g. by collecting accounts of a predefined group like members of a parliament,
but because the functional characteristics of the Twitter platform mold actual use
practices, the technical structuring of potential units of analysis is highly relevant.
Focussing on Web media, digital methods (Rogers, 2013) thus urge to pay attention to
the way in which digital objects are defined by and processed through online devices.
As data can be captured and stored in different ways the decisions made on this
level have repercussions for analytical possibilities further down the chain. Hence,
capturing tools already participate in the framing of the empirical as such. Attempting
to facilitate epistemological and methodological diversity, DMI-TCAT closely
follows Twitter’s specific information structures[12], leaving the primary” material
untouched, while allowing for plasticity in sample design and easy ways to create
subsamples from captured data sets.
Apart from technical specifications, Twitter also defines and regulates the modes
and scope of access to any data (Puschmann and Burgess, 2014). Legal constraints, API
definitions, rate limits for query calls, whitelisting, and data sharing agreements
are among the many possibilities the company has to design the ways its data can
become part of a research project. Because APIs are designed to enhance Twitter’s
value as a commercial platform by allowing third-party developers to build
applications on top of it, the needs of researchers are not explicitly taken into
account. Toolsets like DMI-TCAT thus repurpose these technical interfaces for
research. The following section shows by which different technical pathways data
enter into the system.
3.1.1 Data acquisition. DMI-TCAT relies on Twitter’s APIs and is therefore bound to
their possibilities and limitations. While we do not require familiarity with these
technical interfaces, we notify users when problems occur, e.g. when rate limits
are exceeded. Our tool connects to Twitter using the tmhOAuth[13] library and
retrieves tweets via both the streaming API and the REST API[14]. We use the former
for three different sampling techniques. First, researchers can capture a “1 percent”
random sample[15] of all tweets passing through Twitter, which can then be used for
macro- and meso-level investigations and for baselining samples retrieved by other
means (Gerlitz and Rieder, 2013). Second, we use the statuses/filter endpoint[16] to
“track” tweets containing specific keywords in real-time, which is probably the most
common way to create a sample of tweets. To give researchers maximum flexibility
and specificity, a collection is defined in a so-called “query bin,” i.e. a list of tracking
criteria[17] consisting of single or multiple keyword queries, hashtags, and specific
phrases. For example, a bin like (globalwarming, “global warming,” #IPCC) would
267
Toolset for
capturing and
analyzing tweets
retrieve all tweets containing one of these three query elements and combine them into
a single data set, stored as a group of related tables in the database. Third, our system
allows for following tweets from a specified set of up to 5,000 users. This is particularly
interesting when studying a set of manually selected accounts, such as members
of a parliament or other expert lists.
One of the main limits of the streaming API is that it cannot provide historical
tweets. The REST API[18], however, enables a search for tweets up to about a week old
and although it explicitly omits an unknown percentage of tweets[19], a data set can be
started by retrieving tweets up to about a week old. While this is far from ideal, it
might be the only feasible way to record traces of an unanticipated event. In the same
spirit, we use search to fill gaps in data capture resulting from network outages or
other technical problems. Finally, the REST API allows for retrieving the last 3,200
tweets for each user in a set, providing a level of historicity for user samples, and the
retrieval of follower/followee networks.
Additionally, DMI-TCAT takes Twitter’s data sharing policy into account, which
allows for sharing of tweet ids but not of messages and metadata themselves[20].
Our tool is therefore able to reconstruct a data set from a list of ids and can export such
a list as well. Along the same line, we provide import scripts for yTK databases or a set
of Twitter JSON files captured by other means (e.g. streamR).
Taken together, these possibilities allow for a wide array of sampling techniques, in
line with the principle of methodological flexibility.
3.1.2 Data storage and performance. Following the arguments for paying close
attention to Twitter’s informational structures outlined above, our database layout
mimics the shape of the data returned by the API[21]. This means that tweets and their
metadata, hashtags, URLs, and mentions are stored in separate tables. DMI-TCAT
therefore does not need to extract those entities anymore, making querying the
database much faster.
While there are indeed limits to the amount of data one can store and analyze
without moving into the complicated and costly world of distributed computing,
a well-designed and indexed database structure, combined with optimized database
queries, means that off-the-shelf consumer hardware can handle much larger quantities
of tweets than sometimes argued (Bruns and Liang, 2012). We are currently running
DMI-TCAT on a cheap Linux machine with four processor cores, a 512GB SSD,
and 32GB of RAM, using the default LAMP[22] stack. At the time of writing, we have
captured over 700 million tweets and basic analyses for even the largest query
bins over 50 million tweets in a single data set generally complete in under a
minute, allowing for iterative approaches to analysis. More complex forms of analysis,
such as the creation of mention networks, can take several minutes to complete,
however. While we have not systematically evaluated how far our architecture can
scale, it seems safe to say that hundreds of millions of tweets in a single data set should
still be workable, but moving on to the next order of magnitude would certainly require
a fully distributed approach to tools and infrastructure that is beyond the scope of
our software.
3.1.3 Data enrichment. One area where our system strays from simply capturing
and storing data provided by the API is data enrichment. We currently follow two
directions: URL expansion and the addition of Klout scores. First, many URLs passing
through Twitter are shortened, and although Twitter provides the “final” URL for its
own shortening service, this is not the case for third party shortening services such
as bit.ly. However, in tradition with other digital methods tools (Rogers, 2010),
268
AJIM
66,3
URLs and, in particular, domain names are considered as crucial components of
a tweet’s message and a robust means for actor identification and content qualification.
DMI-TCAT therefore includes a script that follows all URLs to their endpoint,
adds the location to the URL table, and extracts the domain name. Second, we provide
the option to retrieve users’ Klout score[23], a proprietary metric set in the sociometric
tradition that produces an “influence” rating based on data from eight different
social media platforms. While caution is in order when using proprietary metrics,
Klout scores are commonly used and offer a glimpse into users’ activities beyond the
Twitter platform.
3.2 Data analysis
In contrast to yTK, DMI-TCAT is not limited to capturing data but also provides
analytical techniques to researchers, in a way that strikes a balance between ease
of use and analytical flexibility. We try to enable approaches spanning the “three
major areas of analysis” (Bruns and Liang, 2012) in Twitter research – tweet statistics
and activity metrics, network analysis, and content analysis but also facilitate
geographical analysis, ethnographic research, and even textual hermeneutics.
Our tool can thus be used in a wide variety of projects, including studies of
everyday conversation, breaking news, crisis communication, political activism
(citizen), journalism, second screen applications, lifestyle and brand communication,
information diffusion, social patterns, ideological frames, sentiment analysis,
prediction, and so forth.
Besides providing a variety of analytical pathways and facilitating the integration
of additional modules, we emphasize epistemic plurality and lower the cost of
development – by embracing Marres’ (2012) assessment that “social research becomes
noticeably a distributed accomplishment: online platforms, users, devices, and
informational practices actively contribute to the performance of digital social
research” (Marres, 2012, p. 139). Pushing this “redistribution of method” further, we
enable the export of derived data in standard formats to be analyzed in software
packages, chosen by researchers themselves, over interactive interfaces and
ready-made (visual) outputs. This makes the tool less convenient for users without
experience in data analysis, but the recent emergence of tools such as the Gephi[24]
graph visualization and manipulation software, which are at the same time easy to use
and much more powerful than any Web interface, justifies this compromise.
This points to a conundrum that any toolmaker faces: to what extent and in what
way does research software actually shape research practices? How to justify decisions
and how to organize the design process? While there are few firm guidelines, the above
describes a particular balance between methodological flexibility and ease of use that
is the outcome of constant interaction with students of a large MA class on digital
methods, with the participants of multiple workshops, and with several research
projects at the University of Amsterdam relying on DMI-TCAT. To accommodate
this interaction, we rely on an agile software development approach using rapid
prototyping, iterative updating, and a modular architecture. While decisions
necessarily have to be made, they can be shared in a flexible way because tool-making
and actual research remain tightly coupled. Most of the analytical outputs DMI-TCAT
provides were thus built in response to particular analytical requirements from either
ourselves or from researchers we have been working with. However, the collaborative
setting cannot fully alleviate the fact that toolmakers necessarily intervene deeply in the
epistemic process of methods development, by framing ideas in terms of feasibility, cost,
269
Toolset for
capturing and
analyzing tweets
formalization, and so forth, but also by constantly translating and connecting social
science and humanities concepts to computational techniques.
In the following sections, we will detail the analytical techniques implemented
in DMI-TCAT, starting with sub-sampling from stored data and continuing with
summary presentations of different analytical outputs.
3.2.1 Sub-sampling. DMI-TCAT enables the flexible constitution of a subsample.
After selecting a data set, as defined by a query bin, different techniques to filter the
data set are available.
By sub-selecting from the data set, a user can zoom in on a specific time period or
on tweets matching certain criteria (Figure 2). She can choose to include only those
tweets matching a particular phrase such as a word, hashtag, or mention; she can
exclude tweets matching a specific phrase; finally, she can focus on tweets by
particular users or tweets mentioning a specific (part of a) URL. All input fields accept
multiple phrases or keywords to specify (AND) or expand (OR) the selection via
Boolean queries. After updating the overview, a summary of the selection is generated
(Figure 3). While these filters are far from exhaustive, they allow for both the
constitution of a subsample and what could be described as “interactive probing,” i.e.
the back and forth movement between query and overview that “progresses in an
iterative process of view creation, exploration, and refinement” (Heer and
Shneiderman, 2012). This also echoes Uprichard’s argument that social research
rarely proceeds in linear fashion, but that “cases are ‘made’, both conceptually
and empirically, by constantly and iteratively re-shaping and re-matching theory and
empirical evidence together” (Uprichard, 2013, p. 5).
In addition, sub-sampling can be seen as a means for non-destructive data cleaning
in the sense that tweets matching specific criteria can be excluded without being
deleted. While this is currently implemented in rudimentary fashion only, the question
of data cleaning is crucial for both the reliability of the data and the question
of epistemic plurality, or as Gitelman (2013, p. 5) puts it “data [y]needtobe
understood [y] according to the uses which they are and can be put”; one researcher’s
noise is another one’s object of study.
The overview interface (Figure 3) lists the current selection criteria and shows the
number of tweets in the subset, the number of distinct users, and the proportion
of tweets containing URLs. Additionally, a line graph shows the frequency of tweets,
distinct users, distinct locations, and geo-tagged tweets per hour or per day, depending
on the scope of the selection. If a search query is specified, a second line graph indicates
the relationship between the subset and the full dataset. The example in Figure 3 thus
not only shows the shrinking absolute number of tweets mentioning (snowden) in our
“prism” dataset over the month of July 2013, but also the relative decline of tweets
mentioning the whistleblower’s name in relation to the whole data set.
Subsequent analytical techniques apply to the selected subsample, although the full
data set is, indeed, a possible selection. With the exception of several interactive
modules, all analyses are provided as exports in standard tabulated formats or as
network files. Filenames include filters and settings used in the interface so that
researchers know how data has been derived and which software version was used.
The following four sections describe the various types of techniques currently
implemented in our tool and mimic the sections of the actual interface.
3.2.2 Tweet statistics and activity metrics. The first set of exports covers some of the
basic statistics of the sub-sample one may want to consult. To get a quick
characterization of the types of tweets in the sample a table is provided with the total
270
AJIM
66,3
Data selection
Select the data set:
Select parameters:
Query:
Exclude:
From user:
URL (or part of URL):
Startdate: 2014-02-15
2014-02-16Enddate:
(empty: containing any text)
(empty: exclude nothing)
(empty: from any user)
(empty: any or all URLs)
(YYYY-MM-DD)
(YYYY-MM-DD)
update overview
globalwarming --- 22.211.422 tweets from 2012-11-23 15:53:44 to 2014-02-17 10:08:51 732.995.484 tweets archived so far (and counting)
Figure 2.
Screenshot of the “data
selection” part of the
DMI-TCAT interface
271
Toolset for
capturing and
analyzing tweets
Overview of your selection
Dataset:
Search query:
prism (nsa, palanteer, prism, spying, cyberwar, wiretap)
snowden
Exclude:
From user:
(Part of) URL:
Startdate:
Enddate:
Number of tweets:
Number of distinct users:
2013-07-01
2013-07-31
262.023
117.802
24,000
18,000
12,000
6,000
0
100
75
50
25
0
2013.01.07
2013.02.07
2013.03.07
2013.04.07
2013.05.07
2013.06.07
2013.07.07
2013.08.07
2013.09.07
2013.10.07
2013.11.07
2013.12.07
2013.13.07
2013.14.07
2013.15.07
2013.16.07
2013.17.07
2013.18.07
2013.19.07
2013.20.07
2013.21.07
2013.22.07
2013.23.07
2013.24.07
2013.25.07
2013.26.07
2013.27.07
2013.28.07
2013.29.07
2013.30.07
2013.31.07
2013.01.07
2013.02.07
2013.03.07
2013.04.07
2013.05.07
2013.06.07
2013.07.07
2013.08.07
2013.09.07
2013.10.07
2013.11.07
2013.12.07
2013.13.07
2013.14.07
2013.15.07
2013.16.07
2013.17.07
2013.18.07
2013.19.07
2013.20.07
2013.21.07
2013.22.07
2013.23.07
2013.24.07
2013.25.07
2013.26.07
2013.27.07
2013.28.07
2013.29.07
2013.30.07
2013.31.07
Date and time are in GMT (London).
Norm
Query (%)
Geo coded
Locations
Users
Tweets
Tweets
containing no
links
30.6%
69.
Tweets
containing links
Figure 3.
Screenshot of the
“overview” section
of the DMI-TCAT
interface
272
AJIM
66,3
number of tweets, the number of tweets with URLs, hashtags, and mentions, as well as
the number of retweets and the number of unique users in the selection. To characterize
user activity and visibility, a table is provided which lists the minimum, maximum,
average, and median number of tweets sent, the number of users mentioned, the
number of followers and followees, and the number of URLs tweeted, all per user.
Furthermore, we follow emerging standards in Twitter research and allow for easy
analysis of basic platform elements over time (cf. Bruns and Burgess, 2012). This
includes counts of hashtags, user tweets and mentions, URLs and domain names, as
well as retweets[25].
Each of these outputs can be segmented into hourly, daily, weekly, monthly, and
yearly intervals; self-chosen intervals – e.g. to delineate distinct periods – are possible
as well and permit fine-grained temporal analysis. Although a full discussion
of the various metrics and their uses has to be deferred to a future publication, it is
important to mention that certain outputs provide deeper analytical perspectives: the
hashtag-user output, for example, not only provides a hashtag count per time interval,
but also the number of distinct users sending tweets containing the hashtag, the
number of distinct users mentioned in tweets with the hashtag, and the number of
tweets mentioning the hashtag.
3.2.3 Tweet exports. This section of the interface regroups modules providing lists
of actual tweets for further analysis. A random set of a user-specified number of tweets
can facilitate content analysis by providing a sample of items to be (manually) coded
into categories or otherwise analyzed. It is also possible to simply export all tweets and
their metadata from the current selection or, alternatively, only those that have been
retweeted or come with geo-location data.
A statistical exploration of the data via the methods outlined in the previous section,
combined with different analyses of actual tweets enables powerful mixed methods
approaches (cf. Lewis et al., 2013). For example, researchers can export a chronological
list of the most retweeted messages, which is an interesting means to reconstruct and
narrate the timeline of an event (Rogers et al., 2009). From close reading to text mining,
the easy availability of actual tweets is crucial for both quantitative and qualitative
examinations of content.
3.2.4 Networks. The third set of outputs focusses on network perspectives and
produces outputs in either GEXF or GDF formats. The difference with statistical
approaches lies not so much in what is being looked at, but rather in how the data are
represented and analyzed. At the moment, the main focus lies on users, hashtags, and
URLs and the various relationships between these entities. Two outputs represent
interaction networks where users are connected either through mentions or through
direct replies. Because these files can be opened in different graph analysis tools, a wide
variety of social network analysis techniques can be applied. This affords perspectives
on interaction patterns that go beyond mere frequency and allows, for example,
identifying cliques or sub-conversations.
A co-hashtag network output allows for a type of content analysis that focusses on
relationships between these signal words: if two hashtags appear in the same tweet, a
link is established; the more often they co-occur, the stronger the link. By applying
network analysis techniques, one can get an overview of the subject variety in a set of
tweets and analyze relationships between subtopics.
Finally, there are a number of bipartite graphs outputs networks containing
entities of two kinds constituted through co-occurrence in a tweet in particular
hashtag-user and hashtag-URL/domain networks. Figure 4 provides a short example
273
Toolset for
capturing and
analyzing tweets
for the latter. These techniques allow for the structural analysis of relationships
between entities and are particularly useful for locating actors (users or domains) in
relation to issues (hashtags).
3.2.5 Experimental modules. Because DMI-TCAT is modular, it is easy to add new
analytical techniques. A series of experimental modules provide interactive interfaces
or dashboards rather than file exports. A detailed presentation is beyond the scope of
this paper, but our “cascade” module, a means to visually explore temporal structures
and retweet patterns, serves as an example.
This module (Figure 5) provides a ground-level view of tweet activity by either
charting every single tweet in the current selection or only those above a certain
retweet threshold. User accounts are distributed vertically; tweets shown as
dots – are spread out horizontally over time. Lines indicate retweets. At the top we see
the typical activity pattern of a retweet bot (line of dark dots). This view requires
a large screen and is limited to small data selections, but because tweet text becomes
visible when hovering over a node, it allows for the close reading of a conversation or
debate and, in a sense, links to ethnographic observation.
4. Conclusions
In this paper, we have described a tool to capture and analyze data from Twitter. We
have shown how particular design decisions can be related to wider considerations
concerning the role of software in academic research. Our proposition is not simply
a “solution” to a set of “problems.” Rather, it is an attempt to connect the question
of toolmaking for social and cultural research to debates dealing with the “politics of
method” (Savage and Burrows, 2007, p. 895) in ways that are not merely theoretical or
Figure 4.
Gephi graph visualization
of a bipartite graph from
our “datascience” data
set: hashtags are in dark
gray and domain names
in light gray
274
AJIM
66,3
critical. Platforms like Twitter pose a number of fundamental challenges to scholars.
Beyond being attentive to these questions, we have to ask how the tools we use to
practice research can proactively take those challenges into account. The canonical
style of both research reporting and technical publication leave little space to
connect to fundamental interrogations of methodology and its repercussions for the
production of knowledge. But the very nature of computational methods, which deeply
entangle research design with technical work, requires us to engage toolmaking from
different angles. When even small decisions in database design can lead to huge
differences in performance, potentially having profound effects on the way researchers
interact with the tools and data, we realize that even details in implementation can
have substantial epistemic effects. Are we missing a genre of academic text that
permits a combination of technical presentation and general methodological
discussion? The direct relationship between engineering questions and
methodological considerations is a subject that is often neglected and merits much
more critical debate.
This paper in no way suggests to serve as a blueprint for such an endeavor, but in
order to engage the enormous methodological challenges, we feel compelled to
experiment with forms of writing and academic expression that attempt to span
various disciplinary traditions, even if this might lead to disorientation and friction
with established conventions.
DigiHumanatee
DigHuman
txtcraigbellamy
EducationJobsAu
jorgedoprado
UKJobsOnline
parimalrohit
HRH_QueenB
dhpoco
Elijah_Meeks
walkerabroad
dgolumbia
sunnyblossom1
a_small_lab
CareersAndJobs1
roopikarisam
LiteraryKitten
kramermj
chinolatino78
BooksTellYouWhy
linzijuliano
electricarchaeo
ryancordell
CinziaPG
jessyjaded
cjdartagnanlove
josieahlquist
jaheppler
GBDataStream
jamessmithies
PhD2Published
HybridPed
adelinekoh
rjordan_csu
PoeticsHeretic
literarychica
Maperez324
language_jobs
iamdan
zhoel13
JonathanHsy
Jessifer
knagasaki
travelingsun
cookout70
kkaczmawr
avanegmoND14
djp2025
TheHistoryFeed
BjoernGebert
olekroll
viola_lasmana
yaskondo
AnaCristinaPrts
Ryan__Hunt
skgoetz
NegaTivReCorD
vanillahusky7
LansSolo
_Nickoal_
NPDIreland
nzerik
cristobalcobo
CureTMS
CharlotteRock
_John_Handel
Figure 5.
Partial screenshot
of the “cascade”
module interface
275
Toolset for
capturing and
analyzing tweets
The design of DMI-TCAT is inspired by exchanges and debates with scholars
from a variety of disciplines and our attempt to propose a flexible and extensible
tool that accommodates a wide array of methodological approaches is directly
motivated by the desire to keep computational work open for various epistemic
sensibilities.
Notes
1. Available at https://github.com/digitalmethodsinitiative/dmi-tcat (accessed February 19,
2013).
2. http://discovertext.com (accessed September 14, 2013).
3. http://truthy.indiana.edu (accessed September 14, 2013).
4. http://topsy.com (accessed September 14, 2013).
5. http://twitonomy.com (accessed September 14, 2013).
6. https://hootsuite.com (accessed September 14, 2013).
7. https://scraperwiki.com (accessed September 14, 2013).
8. http://cran.r-project.org/web/packages/streamR/ (accessed September 14, 2013).
9. https://github.com/WebEcologyProject/140kit (accessed September 14, 2013).
10. https://github.com/540co/yourTwapperKeeper (accessed September 14, 2013).
11. DMI-TCAT stores every field returned by Twitter (around 40 per tweet if the tweet
contains mentions, hashtags, and URLs) while yTK only stores 13 of the most basic fields
per tweet and excludes fields such as retweet_id, in_reply_to_status_id, entities, and many
fields related to the sender of the tweet.
12. The API documentation is thus an essential part of DMI-TCAT’s documentation. https://
dev.twitter.com/docs/platform-objects (accessed September 1, 2013) specifies Twitter’s
entities and possible actions.
13. tmhOAuth by Matt Harris is available at https://github.com/themattharris/tmhOAuth. It
implements all possible calls to the Twitter APIs in PHP (accessed September 1, 2013).
14. For an explanation of the differences, see https://dev.twitter.com/docs/streaming-apis
(accessed September 12, 2013).
15. https://dev.twitter.com/docs/api/1.1/get/statuses/sample (accessed September 12, 2013).
16. https://dev.twitter.com/docs/api/1.1/post/statuses/filter (accessed September 12, 2013).
17. Tracking criteria follow https://dev.twitter.com/docs/streaming-apis/parameters#track
(accessed September 12, 2013).
18. https://dev.twitter.com/docs/api/1.1 (accessed September 12, 2013).
19. “The Search API is not complete index [sic] of all Tweets, but instead an index of recent
Tweets. At the moment that index includes between 6-9 days of Tweets.” https://dev.twitter.
com/docs/using-search (accessed September 12, 2013).
20. According to I.4.A from https://dev.twitter.com/terms/api-terms (accessed September 10,
2013), “If you provide downloadable datasets of Twitter Content or an API that returns
Twitter Content, you may only return IDs (including tweet IDs and user IDs).”
21. We aim to always incorporate all fields returned by Twitter’s APIs. We store all data in
UTF-8 and are thus able to capture and analyze tweets in any language. See https://dev.
twitter.com/docs/counting-characters (accessed September 10, 2013) for more information.
276
AJIM
66,3
22. http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29 (accessed September 12,
2013). DMI-TCAT has been tested on Linux and OSX.
23. http://klout.com/corp/how-it-works (accessed September 7, 2013).
24. Available as free software on http://gephi.org (accessed September 12, 2013).
25. We identify retweets by using Twitter’s retweet_id API field as well as by grouping
“identical” tweets, thus also including possible manual retweets.
References
Boyd, D. and Crawford, K. (2012), “Critical questions for big data”, Information, Communication
& Society, Vol. 15 No. 5, pp. 662-679.
Bruns, A. and Burgess, J. (2012), “Researching news discussion on Twitter: new methodologies”,
Journalism Studies, Vol. 13 Nos 5-6, pp. 801-814.
Bruns, A. and Liang, Y.E. (2012), “Tools and methods for capturing Twitter data during natural
disasters”, First Monday, Vol. 17 No. 4, p. 5, available at: http://firstmonday.org/ojs/index.
php/fm/article/view/3937/3193 (accessed February 26, 2014).
Cioffi-Revilla, C. (2010), “Computational social science”, Wiley Interdisciplinary Reviews:
Computational Statistics, Vol. 2 No. 3, pp. 259-271.
Gerlitz, C. and Rieder, B. (2013), “Mining one percent of twitter: collections, baselines, sampling”,
M/C Journal, Vol. 16 No. 2, available at: www.journal.media-culture.org.au/index.php/
mcjournal/article/view/620 (accessed February 26, 2014).
Gitelman, L. (Ed.) (2013), Raw Data’ Is an Oxymoron, MIT Press, Cambridge, MA.
Heer, J. and Shneiderman, B. (2012), “Interactive dynamics for visual analysis”, Queue, Vol. 10
No. 2, pp. 30-55.
Kirschenbaum, M.G. (2010), “What is digital humanities and what’s it doing in English
departments?”, ADE Bulletin, Vol. 150 No. 7, pp. 55-61.
Latour, B. (2005), Reassembling the Social: An Introduction to Actor-Network-Theory, Oxford
University Press, New York, NY.
Lazer, D., Pentland, A.(S.), Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Christakis, N.,
Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D. and
Van Alstyne, M. (2009), “Life in the network: the coming age of computational social science”,
Science, Vol. 323 No. 5915, pp. 721-723.
Lewis, S.C., Zamith, R. and Hermida, A. (2013), “Content analysis in an era of big data: a hybrid
approach to computational and manual methods”, Journal of Broadcasting & Electronic
Media, Vol. 57 No. 1, pp. 34-52.
McKelvey, K. and Menczer, F. (2013), “Design and prototyping of a social media observatory”,
Proceedings of the 22nd international conference on World Wide Web companion,
Rio de Janeiro, International World Wide Web Conferences Steering Committee, Geneva,
pp. 1351-1358.
Marres, N. (2012), “The redistribution of methods: on intervention in digital social research,
broadly conceived”, The Sociological Review, Vol. 60 No. S1, pp. 139-165.
Puschmann, C. and Burgess, J. (2014), “The politics of twitter data”, in Weller, K., et al. (Eds),
Twitter and Society, Peter Lang Publishing, New York, NY, pp. 43-54.
Rieder, B. and Ro
¨
hle, T. (2012), “Digital methods: five challenges”, in Berry, D.M. (Ed.),
Understanding Digital Humanities, Palgrave Macmillan, Basingstoke, pp. 67-84.
Rogers, R. (2010), “Mapping public web space with the Issuecrawler”, in Brossard, C. and
Reber, B. (Eds), Digital Cognitive Technologies: Epistemology and Knowledge Society, Wiley,
London, pp. 115-126.
277
Toolset for
capturing and
analyzing tweets
Rogers, R. (2013), Digital Methods, MIT Press, Cambridge, MA.
Rogers, R., Jansen, F., Stevenson, M. and Weltevrede, E. (2009), “Mapping democracy”, in
Finlay, A. (Ed.), Global Information Society Watch 2009, Association for Progressive
Communications and Hivos, Uruguay, pp. 47-57, available at: www.giswatch.org/fr/node/
158 (accessed February 26, 2014).
Savage, M. and Burrows, R. (2007), “The coming crisis of empirical sociology”, Sociology, Vol. 41
No. 5, pp. 885-899.
Thrift, N. (2005), Knowing Capitalism, Sage, London.
Uprichard, E. (2013), “Sampling: bridging probability and non-probability designs”, International
Journal of Social Research Methodology, Vol. 16 No. 1, pp. 1-11.
Uprichard, E., Burrows, R. and Byrne, D. (2008), “SPSS as an “inscription device”: from causality
to description?”, The Sociological Review, Vol. 56 No. 4, pp. 606-622.
Winner, L. (1980), “Do artifacts have politics?”, Daedalus, Vol. 109 No. 1, pp. 121-136.
About the authors
Erik Borra is a PhD Candidate and Lecturer at the University of Amsterdam’s MA Program in
New Media. His research concerns the Web as a source of data for social and cultural research,
paying particular attention to search engine queries, Wikipedia edit histories, and social
networks. Erik is also scientific programmer for the Digital Methods Initiative and is currently
involved in the European research project “Electronic Maps to Assist Public Science” (EMAPS).
Erik Borra is the corresponding author and can be contacted at: borra@uva.nl
Dr Bernhard Rieder is an Associate Professor of New Media at the University of Amsterdam.
Besides developing and theorizing digital methods, his research focusses on the history, theory,
and politics of software, particularly on the role of algorithms in social processes and the
production of knowledge. He has worked as a Web Programmer on various projects and is
currently writing a book that investigates the history and cultural significance of information
processing techniques.
To purchase reprints of this article please e-mail: reprints@emeraldinsight.com
Or visit our web site for further details: www.emeraldinsight.com/reprints
278
AJIM
66,3
... Several social science computational methods are used in this book, and this section attempts to briefly summarize them. In terms of data collection, a number of webometric tools have been employed in order to retrieve data from YouTube-Netvizz and Webometric Analyst 2.0 (Rieder, 2015;Thelwall, 2009); Facebook-NVivo's N-Capture and Netvizz; as well as Twitter-Crimson Hexagon and the Boston University Twitter Collection and Analysis Toolkit (BU-TCAT; Borra & Rieder, 2014;Groshek, 2014). In some cases, data collection has been done manually, such as in the case of some Instagram images. ...
... The desire to understand how ride itself is portrayed by participants and organizers to other community members and their social media followers overall necessitated a focus on images associated with the Ride. A lack of standardized, targeted data collection strategies for social media posts, with some scholars preferring a manual approach (Carrotte et al., 2017) while others rely on web code to sort and collect posts (Borra & Rieder, 2014;Peruta & Shields, 2017), necessitated the development of a sampling and analysis strategy. ...
Article
Full-text available
The annual Chief Big Foot Memorial Ride represents the longest continuous example of Lakota memorial and resistance rides in contemporary Lakota activism. First held in 1986, this commemoration of the journey of Chief Big Foot’s band of Lakotas and the subsequent Wounded Knee Massacre in 1890 now reaches beyond the confines of the ride itself through the use of social media profiles that serve to both publicize and document the ride. This article seeks to understand the way that photographs from the rides influence the types and amount of engagement it receives on social media. Using a qualitative and quantitative approach, 304 images and their associated engagements from the 2018 ride were analyzed using content analysis and a grounded theory approach. This revealed that certain characteristics gave rise to the construction of a counterpublic around this ride. Findings suggest that both the content of photos and types of authors for posts influenced the number and types of engagements received by certain photographs. Given the relative isolation of many Indigenous communities in the Americas, these findings suggest that certain strategies for social media posts by Indigenous social movements can overcome these barriers to spread their message to a wider audience through strategic use of imagery associated with these movements.
... To obtain these analytical data we monitored the flow of interactions on the social media platform Twitter, using the Twitter Capture and Analysis Toolset, developed by Digital Methods Initiative (DMI-TCAT) at the University of Amsterdam (Borra & Rieder, 2014). Based on the application of this tool, the study proposed an exclusive methodological design, developed ad hoc to examine the specific problem of this investigation. ...
Article
The use of Twitter as a tool for mobilisation has made digital social and political activism a growing area of interest in communication research. Scholars have underscored the effectiveness of Twitter in galvanising the opinion of broad sectors of the public and expressing the indignation of average citizens on issues of social concern (Bruns et al., 2015; Martínez, 2017). The rise of feminist social media activism has prompted a number of studies on the feminist movement’s use of hashtags to foster online conversations on specific issues (Jinsook, 2017; Turley & Fisher, 2018; etc.). This article examines the correlation between the degree of ideological commitment amongst social media users and the nature of their Twitter conversations on a given issue. The analysis focuses on Twitter conversations generated by feminists, influencers, journalists and politicians in reaction to the controversial sentencing of the Wolf Pack (La Manada) –a gang of men involved in a sexual assault perpetrated during the San Fermín festival in Pamplona. Big data techniques were used to explore the nature of messages containing four highly charged hashtags central to feminist discourse on this issue: #YoSiTeCreo (Yes, I believe you), #HermanaYoSíTeCreo (Yes, sister, I believe you), #Cuéntalo (Talk about it) and #NoEstásSola (You are not alone). Our findings indicate that the levels of ideological commitment of Twitter users participating in what was essentially a feminist conversation varied to an extent that impeded serious interaction amongst them, either online or offline. From the perspective of communication strategy, feminist hashtag activism would appear to be an intermediate step in a longer process of creating a higher consciousness regarding gender equality issues in Spain.
... When Turkey declared the suspension of the 2016 deal with the EU, at the end of February 2020, we started collecting tweets using relevant keywords (e.g. Greece refugee(s), refugeesgr, Turkey asylum) through DMI-TCAT (Borra and Rieder, 2014) which provides a real-time stream of tweets, including retweets and replies. The 31st March 2020 is the last day of our sampling period because we observed a local minimum in the 5-day rolling window variance in the frequency distribution (see Figure 1). ...
Article
Full-text available
This article explores how Europe’s border crises in the post refugee ‘crisis’ years were discussed in the micro-blog Twitter, through an in-depth analysis of boundary making. Our focus is on the tweets of the top influencers of the hashtag #IStandWithGreece who strategically promoted ideologies ranging from white supremacism to Greek nationalism, glued together by an antimigrant stance during a border ‘crisis’ at Europe’s periphery. This network of intolerance promoted a representation of migrants as ‘pawns’; seen like a chess piece, with no value in their own right, literally pushed towards Europe by Turkey, who elevated them into a sizable threat. Within this, Europe was represented as a paradoxical other, the fallen Self, for not rising up to the opportunity to protect its sovereignty and identity through more securitization. Despite being diffused by extreme antimigrant Twitterers, we argue that these tweets offer a more overtly racist expression of otherwise mainstream European (Union) discourses and politics on migration. Effectively, #IStandWithGreece’s influencers functioned as Europe’s alter-ego mouthpiece, saying the unsayable using social media, and their affordances contributing to the normalization of an oppressive and restrictive European border management.
Book
Disinformation and so-called fake news are contemporary phenomena with rich histories. Disinformation, or the willful introduction of false information for the purposes of causing harm, recalls infamous foreign interference operations in national media systems. Outcries over fake news, or dubious stories with the trappings of news, have coincided with the introduction of new media technologies that disrupt the publication, distribution and consumption of news -- from the so-called rumour-mongering broadsheets centuries ago to the blogosphere recently. Designating a news organization as fake, or der Lügenpresse , has a darker history, associated with authoritarian regimes or populist bombast diminishing the reputation of 'elite media' and the value of inconvenient truths. In a series of empirical studies, using digital methods and data journalism, the authors inquire into the extent to which social media have enabled the penetration of foreign disinformation operations, the widespread publication and spread of dubious content as well as extreme commentators with considerable followings attacking mainstream media as fake.
Article
The growing importance of Big Data in the food industry enables businesses to leverage information to gain a competitive advantage. This paper provides a systematic literature review (SLR) to provide an insight into the use of state-of-art of Big Data applications in the food industry. The SLR relies on available literature that provides the context, theoretical construct and identifies gaps. Based on the findings, we suggest recommendations, identify limitations and suggest policy implications and future directions. Using search databases were examined and 38 relevant studies were identified for retrospective analysis. The review shows that Big Data supports the food industry in ways that enable using Artificial Intelligence to manage restaurants and mobile based applications in supporting consumers with restaurant selection. This SLR open new avenues for future research in the importance of Big Data in the food industry, which will surely help researchers/practitioners in effective utilization of Big DataBig Data.
Chapter
Social media is noted for its usefulness and contribution to destination marketing and management. Social media data is particularly valued as a source to understand issues such as tourist behavior and destination marketing strategies. Among the social media platforms, Twitter is one of the most utilized in research. Its use raises two issues: the challenge of obtaining historical data and the importance of qualitative data analysis. Regarding these issues, the chapter argues that retrieving tweets using hashtags and keywords on the Twitter website provides a corpus of tweets that is valuable for research, especially for qualitative inquiries. In addition, the value of qualitative analysis of Twitter data is presented, demonstrating, among other things, how such an approach captures in-depth information, enables appreciation and inclusion of the nonconventional language used on social media, distinguishes between “noise” and useful information, and recognizes information as the sum of all parts in the data.
Chapter
Big data has become a significant topic of interest for data engineering field and academics in learning and research. Exponential data expansion is powered by exponential Cyberspace and digital device growth. Large amounts of data can now be stored and analyzed at a low cost due to technological advancements. Big Data is a collection of real-time data that is organized, semi-structured, or unstructured from several resources. Predictive analysis offers technique for accessing information from big data collections. Many viewers like Google, Facebook, YouTube, Amazon, etc., have recognized the potential to gain competitive advantage through Big Data and Analytics. These methods provide several possibilities such as finding patterns or improved algorithms for optimization. Big data management and analysis is also a few difficulties, including data size, quality, dependability, and comprehensiveness. This article offers a thorough overview of Big Data and Predictive Analysis literature. It provides specifics on basic ideas in this developing area. We concluded by presenting the findings of our research and highlighting future research opportunities in this domain.KeywordsAnalytics and big dataHadoopPredictive analysisIntelligencePattern insightsSocial iedia
Article
This article explores how Europe’s border crises in the post refugee ‘crisis’ years were discussed in the micro-blog Twitter, through an in-depth analysis of boundary making. Our focus is on the tweets of the top influencers of the hashtag #IStandWithGreece who strategically promoted ideologies ranging from white supremacism to Greek nationalism, glued together by an antimigrant stance during a border ‘crisis’ at Europe’s periphery. This network of intolerance promoted a representation of migrants as ‘pawns’; seen like a chess piece, with no value in their own right, literally pushed towards Europe by Turkey, who elevated them into a sizable threat. Within this, Europe was represented as a paradoxical other, the fallen Self, for not rising up to the opportunity to protect its sovereignty and identity through more securitization. Despite being diffused by extreme antimigrant Twitterers, we argue that these tweets offer a more overtly racist expression of otherwise mainstream European (Union) discourses and politics on migration. Effectively, #IStandWithGreece’s influencers functioned as Europe’s alter-ego mouthpiece, saying the unsayable using social media, and their affordances contributing to the normalization of an oppressive and restrictive European border management.
Article
This article presents the implementation of the Video Assistant Referee (VAR) as an example of the increasingly layered mediatization of sports. We argue that, while integrated into the established broadcasting protocols, VAR becomes an object of explicit reflection and popular debate—and increasingly so, when football and its TV coverage are discussed on “technologies of engagement” like Twitter. Combining the concept of mediatization with insights from Science and Technology Studies, this article discusses how and why sports systematically contribute to what we call “mediatized engagements with technologies.” The combination of football’s “media manifold” comprising epistemic technologies, television, and social media with its knowledgeable and emotionally invested audience inevitably limits the “black-boxing” of a refereeing technology. Our case study analyses how fans, journalists, and others evaluate VAR in action on Twitter during the men’s 2018 FIFA World Cup. Based on a multilingual dataset, we show, among other examples, how the media event displays the technology as a historical innovation and analyze why even the allegedly “clear and obvious” cases of its application create controversies. In conclusion, the article discusses how the layered mediatization of sports, its partisanship, and ambivalent relationship with technologies stimulate engagement far beyond the fair refereeing issue.
Book
Full-text available
In Digital Methods, Richard Rogers proposes a methodological outlook for social and cultural scholarly research on the Web that seeks to move Internet research beyond the study of online culture. It is not a toolkit for Internet research, or operating instructions for a software package; it deals with broader questions. How can we study social media to learn something about society rather than about social media use? Rogers proposes repurposing Web-native techniques for research into cultural change and societal conditions. We can learn to reapply such “methods of the medium” as crawling and crowd sourcing, PageRank and similar algorithms, tag clouds and other visualizations; we can learn how they handle hits, likes, tags, date stamps, and other Web-native objects. By “thinking along” with devices and the objects they handle, digital research methods can follow the evolving methods of the medium. Rogers uses this new methodological outlook to examine such topics as the findings of inquiries into 9/11 search results, the recognition of climate change skeptics by climate-change-related Web sites, and the censorship of the Iranian Web. With Digital Methods, Rogers introduces a new vision and method for Internet research and at the same time applies them to the Web's objects of study, from tiny particles (hyperlinks) to large masses (social media).
Article
Full-text available
Massive datasets of communication are challenging traditional, human-driven approaches to content analysis. Computational methods present enticing solutions to these problems but in many cases are insufficient on their own. We argue that an approach blending computational and manual methods throughout the content analysis process may yield more fruitful results, and draw on a case study of news sourcing on Twitter to illustrate this hybrid approach in action. Careful combinations of computational and manual techniques can preserve the strengths of traditional content analysis, with its systematic rigor and contextual sensitivity, while also maximizing the large-scale capacity of Big Data and the algorithmic accuracy of computational methods.
Article
A taxonomy of tools that support the fluent and flexible use of visualizations.
Article
Introduction Social media platforms present numerous challenges to empirical research, making it different from researching cases in offline environments, but also different from studying the “open” Web. Because of the limited access possibilities and the sheer size of platforms like Facebook or Twitter, the question of delimitation, i.e. the selection of subsets to analyse, is particularly relevant. Whilst sampling techniques have been thoroughly discussed in the context of social science research (Uprichard; Noy; Bryman; Gilbert; Gorard), sampling procedures in the context of social media analysis are far from being fully understood. Even for Twitter, a platform having received considerable attention from empirical researchers due to its relative openness to data collection, methodology is largely emergent. In particular the question of how smaller collections relate to the entirety of activities of the platform is quite unclear. Recent work comparing case based studies to gain a broader picture (Bruns and Stieglitz) and the development of graph theoretical methods for sampling (Papagelis, Das, and Koudas) are certainly steps in the right direction, but it seems that truly large-scale Twitter studies are limited to computer science departments (e.g. Cha et al.; Hong, Convertino, and Chi), where epistemic orientation can differ considerably from work done in the humanities and social sciences. The objective of the paper is to reflect on the affordances of different techniques for making Twitter collections and to suggest the use of a random sampling technique, made possible by Twitter’s Streaming API (Application Programming Interface), for baselining, scoping, and contextualising practices and issues. We discuss this technique by analysing a one percent sample of all tweets posted during a 24-hour period and introduce a number of analytical directions that we consider useful for qualifying some of the core elements of the platform, in particular hashtags. To situate our proposal, we first discuss how platforms propose particular affordances but leave considerable margins for the emergence of a wide variety of practices. This argument is then related to the question of how medium and sampling technique are intrinsically connected. Indeterminacy of Platforms A variety of new media research has started to explore the material-technical conditions of platforms (Rogers`; Gillespie; Hayles), drawing attention to the performative capacities of platform protocols to enable and structure specific activities; in the case of Twitter that refers to elements such as tweets, retweets, @replies, favourites, follows, and lists. Such features and conventions have been both a subject and a starting point for researching platforms, for instance by using hashtags to demarcate topical conversations (Bruns and Stieglitz), @replies to trace interactions, or following relations to establish social networks (Paßmann, Boeschoten, and Schäfer). The emergence of platform studies (Gillespie; Montfort and Bogost; Langlois et al.) has drawn attention to platforms as interfacing infrastructures that offer blueprints for user activities through technical and interface affordances that are pre-defined yet underdetermined, fostering sociality in the front end whilst mining for data in the back end (Stalder). Doing so, they cater to a variety of actors, including users, developers, advertisers, and third-party services, and allow for a variety of distinct use practices to emerge. The use practices of platform features on Twitter are, however, not solely produced by users themselves, but crystallise in relation to wider ecologies of platforms, users, other media, and third party services (Burgess and Bruns), allowing for sometimes unanticipated vectors of development. This becomes apparent in the case of the retweet function, which was initially introduced by users as verbatim operation, adding “retweet” and later “RT” in front of copied content, before Twitter officially offered a retweet button in 2009 (boyd, Golder, and Lotan). Now, retweeting is deployed for a series of objectives, including information dissemination, promotion of opinions, but also ironic commentary. Gillespie argues that the capacity to interface and create relevance for a variety of actors and use practices is, in fact, the central characteristic of platforms (Gillespie). Previous research for instance addresses Twitter as medium for public participation in specific societal issues (Burgess and Bruns; boyd, Golder, and Lotan), for personal conversations (Marwick and boyd; boyd, Golder, and Lotan), and as facilitator of platform-specific communities (Paßmann, Boeschoten, and Schäfer). These case-based studies approach and demarcate their objects of study by focussing on particular hashtags or use practices such as favoriting and retweeting. But using these elements as basis for building a collection of tweets, users, etc. to be analysed has significant epistemic weight: these sampling methods come with specific notions of use scenarios built into them or, as Uprichard suggests, there are certain “a priori philosophical assumptions intrinsic to any sample design and the subsequent validity of the sample criteria themselves” (Uprichard 2). Building collections by gathering tweets containing specific hashtags, for example, assumes that a) the conversation is held together by hashtags and b) the chosen hashtags are indeed the most relevant ones. Such assumptions go beyond the statistical question of sampling bias and concern the fundamental problem of how to go fishing in a pond that is big, opaque, and full of quickly evolving populations of fish. The classic information retrieval concepts of recall (How many of the relevant fish did I get?) and precision (How many fish caught are relevant?) fully apply in this context. In a next step, we turn more directly to the question of sampling Twitter, outlining which methods allow for accessing which practices – or not – and what the role of medium-specific features is. Sampling Twitter Sampling, the selection of subsets from a larger set of elements (the population), has received wide attention especially in the context of empirical sociology (Uprichard; Noy; Bryman; Gilbert; Gorard; Krishnaiah and Rao). Whilst there is considerable overlap in sampling practices between quantitative sociology and social media research, some key differences have to be outlined: first, social media data, such as tweets, generally pre-exist their collection rather than having to be produced through surveys; secondly, they come in formats specific to platforms, with analytical features, such as counts, already built into them (Marres and Weltevrede); and third, social media assemble very large populations, yet selections are rarely related to full datasets or grounded in baseline data as most approaches follow a case study design (Rieder). There is a long history to sampling in the social sciences (Krishnaiah and Rao), dating back to at least the 19th century. Put briefly, modern sampling approaches can be distinguished into probability techniques, emphasising the representative relation between the entire population and the selected sample, and non-probability techniques, where inference on the full population is problematic (Gilbert). In the first group, samples can either be based on a fully random selection of cases or be stratified or cluster-based, where units are randomly selected from a proportional grid of known subgroups of a population. Non-probability samples, on the contrary, can be representative of the larger population, but rarely are. Techniques include accidental or convenience sampling (Gorard), based on ease of access to certain cases. Purposive non-probability sampling however, draws on expert sample demarcation, on quota, case-based or snowball sampling techniques – determining the sample via a priori knowledge of the population rather than strict representational relations. Whilst the relation between sample and population, as well as access to such populations (Gorard) is central to all social research, social media platforms bring to the reflection of how samples can function as “knowable objects of knowledge” (Uprichard 2) the role of medium-specific features, such as built-in markers or particular forms of data access. Ideally, when researching Twitter, we would have access to a full sample, the subject and phantasy of many big data debates (boyd and Crawford; Savage and Burrows), which in practice is often limited to platform owners. Also, growing amounts of daily tweets, currently figuring around 450 million (Farber), require specific logistic efforts, as a project by Cha et al. indicates: to access the tweets of 55 million user accounts, 58 servers to collect a total amount of 1.7 billion tweets (Cha et al.). Full samples are particularly interesting in the case of exploratory data analysis (Tukey) where research questions are not set before sampling occurs, but emerge in engagement with the data. The majority of sampling approaches on Twitter, however, follow a non-probabilistic, non-representative route, delineating their samples based on features specific to the platform. The most common Twitter sampling technique is topic-based sampling that selects tweets via hashtags or search queries, collected through API calls (Bruns and Stieglitz, Burgees and Bruns; Huang, Thornton, and Efthimiadis) Such sampling techniques rest on the idea that content will group around the shared use of hashtags or topical words. Here, hashtags are studied with an interest in the emergence and evolution of topical concerns (Burgees and Bruns), to explore brand communication (Stieglitz and Krüger), during public unrest and events (Vis), but also to account for the multiplicity of hashtag use practices (Bruns and Stieglitz). The approach lends itself to address issue emergence and composition, but also draws attention to medium-specific use practices of hashtags. Snowball sampling, an extension of topic-based sampling, builds on predefined lists of user accounts as starting points (Rieder), often defined by experts, manual collections or existing lists, which are then extended through “snowballing” or triangulation, often via medium-specific relations such as following. Snowball sampling is used to explore national spheres (Rieder), topic- or activity-based user groups (Paßmann, Boeschoten, and Schäfer), cultural specificity (Garcia-Gavilanes, Quercia, and Jaimes) or dissemination of content (Krishnamurthy, Gill, and Arlitt). Recent attempts to combine random sampling and graph techniques (Papagelis, Das, and Koudas) to throw wider nets while containing technical requirements are promising, but conceptually daunting. Marker-based sampling uses medium-specific metadata to create collections based on shared language, location, Twitter client, nationality or other elements provided in user profiles (Rieder). This sampling method can be deployed to study the language or location specific use of Twitter. However, an increasing amount of studies develop their own techniques to detect languages (Hong, Convertino, and Chi). Non-probability selection techniques, topic-, marker-, and basic graph-based sampling struggle with representativeness (Are my results generalisable?), exhaustiveness (Did I capture all the relevant units?), cleanness (How many irrelevant units did I capture?), and scoping (How “big” is my set compared to others?), which does – of course – not invalidate results. It does, however, raise questions about the generality of derived claims, as case-based approaches only allow for sense-making from inside the sample and not in relation to the entire population of tweets. Each of these techniques also implies commitments to a priori conceptualisations of Twitter practices: snowball sampling presupposes coherent network topologies, marker-based sampling has to place a lot of faith in Twitter’s capacity to identify language or location, and topic-based samples consider words or hashtags to be sufficient identifiers for issues. Further, specific sampling techniques allow for studying issue or medium dynamics, and provide insights to the negotiation of topical concerns versus the specific use practices and medium operations on the platform. Following our interest in relations between sample, population and medium-specificity, we therefore turn to random sampling, and ask whether it allows to engage Twitter without commitments – or maybe different commitments? – to particular a priori conceptualisations of practices. Rather than framing the relation between this and other sampling techniques in oppositional terms, we explore in what way it might serve as baseline foil, investigating the possibilities for relating non-probability samples to the entire population, thereby embedding them in a “big picture” view that provides context and a potential for inductive reasoning and exploration. As we ground our arguments in the analysis of a concrete random sample, our approach can be considered experimental. Random Sampling with the Streaming API While much of the developer API features Twitter provides are “standard fare”, enabling third party applications to offer different interfaces to the platform, the so-called Streaming API is unconventional in at least two ways. First, instead of using the common query-response logic that characterises most REST-type implementations, the Streaming API requires a persistent connection with Twitter’s server, where tweets are then pushed in near real-time to the connecting client. Second, in addition to being able to “listen” to specific keywords or usernames, the logic of the stream allows Twitter to offer a form of data access that is circumscribed in quantitative terms rather than focussed on particular entities. The so called statuses/firehose endpoint provides the full stream of tweets to selected clients; the statuses/sample endpoint, however, “returns a small random sample of all public statuses” with a size of one percent of the full stream. (In a forum post, Twitter’s senior partner engineer, Taylor Singletary, states: “The sample stream is a random sample of 1% of the tweets being issues [sic] publicly.”) If we estimate a daily tweet volume of 450 million tweets (Farber), this would mean that, in terms of standard sampling theory, the 1% endpoint would provide a representative and high resolution sample with a maximum margin of error of 0.06 at a confidence level of 99%, making the study of even relatively small subpopulations within that sample a realistic option. While we share the general prudence of boyd and Crawford when it comes to the validity of this sample stream, a technical analysis of the Streaming API indicates that some of their caveats are unfounded: because tweets appear in near real-time in the queue (our tests show that tweets are delivered via the API approx. 2 seconds after they are sent), it is clear that the system does not pull only “the first few thousand tweets per hour” (boyd and Crawford 669); because the sample is most likely a simple filter on the statuses/firehose endpoint, it would be technically impractical to include only “tweets from a particular segment of the network graph” (ibid.). Yet, without access to the complete stream, it is difficult to fully assess the selection bias of the different APIs (González-Bailón, Wang, and Rivero). A series of tests in which we compared the sample to the full output of high volume bot accounts can serve as an indicator: in particular, we looked into the activity of SportsAB, Favstar_Bot, and TwBirthday, the three most active accounts in our sample (respectively 38, 28, and 27 tweets captured). Although Twitter communicates a limit of 1000 tweets per day and account, we found that these bots consistently post over 2500 messages in a 24 hour period. SportsAB attempts to post 757 tweets every three hours, but runs into some limit every now and then. For every successful peak, we captured between five and eight messages, which indicates a pattern consistent with a random selection procedure. While more testing is needed, various elements indicate that the statuses/sample endpoint provides data that are indeed representative of all public tweets. Using the soon to be open-sourced Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT) we set out to test the method and the insights that could be derived from it by capturing 24 hours of Twitter activity, starting on 23 Jan. 2013 at 7 p.m. (GMT). We captured 4,376,230 tweets, sent from 3,370,796 accounts, at an average rate of 50.65 tweets per second, leading to about 1.3GB of uncompressed and unindexed MySQL tables. While a truly robust approach would require a longer period of data capture, our main goal – to investigate how the Streaming API can function as a “big picture” view of Twitter and as baseline for other sampling methods – led us to limit ourselves to a manageable corpus. We do not propose our 24-hour dataset to function as a baseline in itself, but to open up reflections about representative metrics and the possibilities of baseline sampling in general. By making our scripts public, we hope to facilitate the creation of (background) samples for other research projects. (DMI-TCAT is developed by Erik Borra and Bernhard Rieder. The stream capture scripts are already available at https://github.com/bernorieder/twitterstreamcapture.) A Day of Twitter Exploring how the Twitter one percent sample can provide us with a contrast foil against other collection techniques, we suggest that it might allow to create relations between entire populations, samples and medium-specific features in different ways; as illustration, we explore four of them. a) Tweet Practices Baseline: Figure 1 shows the temporal baseline, giving indications for the pace and intensity of activity during the day. The temporal pattern features a substantial dip in activity, which corresponds with the fact that around 60% of all tweets have English language settings, which might indicate sleeping time for English-speaking users. Figure 1: temporal patterns Exploring the composition of users, the sample shows how “communicative” Twitter is; the 3,370,796 unique users we captured mentioned (all “@username” variants) 2,034,688 user accounts. Compared to the random sample of tweets retrieved by boyd et al. in 2009, our sample shows differences in use practices (boyd, Golder, and Lotan): while the number of tweets with hashtags is significantly higher (yet small in relation to all tweets), the frequency of URL use is lower. While these averages gloss over significant variations in use patterns between subgroups and languages (Poblete et al.), they do provide a baseline to relate to when working with a case-based collection. Tweets containing boyd et al. 2010 our findings a hashtag 5% 13.18% a URL 22% 11.7% an @user mention 36% 57.2% tweets beginning with @user 86% 46.8% Table 1: Comparison between boyd et al. and our findings b) Hashtag Qualification: Hashtags have been a focus of Twitter research, but reports on their use vary. In our sample, 576,628 tweets (13.18%) contained 844,602 occurrences of 227,029 unique hashtags. Following the typical power law distribution, only 25.8% appeared more than once and only 0.7% (1,684) more than 50 times. These numbers are interesting for characterising Twitter as a platform, but can also be useful for situating individual cases against a quantitative baseline. In their hashtag metrics, Bruns and Stieglitz suggest a categorisation derived from a priori discussions of specific use cases and case comparison in literature (Bruns and Stieglitz). The random sample, however, allows for alternative, a posteriori qualifying metrics, based on emergent topic clusters, co-appearance and proximity measures. Beyond purely statistical approaches, co-word analysis (Callon et al.) opens up a series of perspectives for characterising hashtags in terms of how they appear together with others. Based on the basic principle that hashtags mentioned in the same tweet can be considered connected, networks of hashtags can be established via graph analysis and visualisation techniques – in our case with the help of Gephi. Our sample shows a high level of connectivity between hashtags: 33.8% of all unique hashtags are connected in a giant component with an average degree (number of connections) of 6.9, a diameter (longest distance between nodes) of 15, and an average path length between nodes of 12.7. When considering the 10,197 hashtags that are connected to at least 10 others, the network becomes much denser, though: the diameter shrinks to 9 and the average path length of 3.2 indicates a “small world” of closely related topic spaces. Looking at how hashtags relate to this connected component, we detect that out of the 1,684 hashtags with a frequency higher than 50, 96.6% are part of it, while the remaining 3.4% are spam hashtags that are deployed by a single account only. In what follows, we focus on the 1,627 hashtags that are part of the giant component. Figure 2: Co-occurrence map of hashtags (spatialisation: Force Atlas 2; size: frequency of occurrence; colour: communities detected by modularity) As shown in Figure 2, the resulting network allows us to identify topic clusters with the help of “community” detection techniques such as the Gephi modularity algorithm. While there are clearly identifiable topic clusters, such as a dense, high frequency cluster dedicated to following in turquoise (#teamfollowback, #rt, #followback and #sougofollow), a cluster concerning Arab countries in brown or a pornography cluster in bright red, there is a large, diffuse zone in green that one could perhaps most fittingly describe as “everyday life” on Twitter, where food, birthdays, funny images, rants, and passion can coexist. This zone – the term cluster suggesting too much coherence – is pierced by celebrity excitement (#arianarikkumacontest) or moments of social banter (#thingsidowhenigetbored, #calloutsomeonebeautiful) leading to high tweet volumes. Figures 3 and 4 attempt to show how one can use network metrics to qualify – or even classify – hashtags based on how they connect to others. A simple metric such as a node’s degree, i.e. its number of connections, allows us to distinguish between “combination” hashtags that are not topic-bound (#love, #me, #lol, #instagram, the various “follow” hashtags) and more specific topic markers (#arianarikkumacontest, #thingsidowhenigetbored, #calloutsomeonebeautiful, #sosargentinosi). Figure 3: Co-occurrence map of hashtags (spatialisation: Force Atlas 2; size: frequency of occurrence; colour (from blue to yellow to red): degree)Figure 4: Hashtag co-occurrence in relation to frequency Another metric, which we call “user diversity”, can be derived by dividing the number of unique users of a hashtag by the number of tweets it appears in, normalised to a percentage value. A score of 100 means that no user has used the hashtag twice, while a score of 1 indicates that the hashtag in question has been used by a single account. As Figures 5 and 6 show, this allows us to distinguish hashtags that have a “shoutout” character (#thingsidowhenigetbored, #calloutsomeonebeautiful, #love) from terms that become more “insisting”, moving closer to becoming spam. Figure 5: Co-occurrence map of hashtags (spatialisation: Force Atlas 2; size: frequency of occurrence; colour (from blue to yellow to red): user diversity) Figure 6: Hashtag user diversity in relation to frequency All of these techniques, beyond leading to findings in themselves, can be considered as a useful backdrop for other sampling methods. Keyword- or hashtag-based sampling is often marred by the question of whether the “right” queries have been chosen; here, co-hashtag analysis can easily find further related terms – the same analysis is possible for keywords also, albeit with a much higher cost in computational resources. c) Linked Sources: Only 11% of all tweets contained URLs, and our findings show a power-law distribution of linked sources. The highly shared domains indicate that Twitter is indeed a predominantly “social” space, with a high presence of major social media, photo-sharing (Instagram and Twitpic) and Q&A platforms (ask.fm). News sources, indicated in red in figure 7, come with little presence – although we acknowledge that this might be subject to daily variation. Figure 7: Most mentioned URLs by domain, news organisations in red d) Access Points: Previously, the increase of daily tweets has been linked to the growing importance of mobile devices (Farber), and relatedly, the sample shows a proliferation of access points. They follow a long-tail distribution: while there are 18,248 unique sources (including tweet buttons), 85.7% of all tweets are sent by the 15 dominant applications. Figure 8 shows that the Web is still the most common access point, closely followed by the iPhone. About 51.7% of all tweets were sent from four mobile platforms (iPhone, Android, Blackberry, and Twitter’s mobile Web page), confirming the importance of mobile devices. This finding also highlights the variety and complexity of the contexts that Twitter practices are embedded in. Figure 8: Twitter access points Conclusion Engaging with the one percent Twitter sample allows us to draw three conclusions for social media mining. First, thinking of sampling as the making of “knowable objects of knowledge” (Uprichard 2), it entails bringing data points into different relations with each other. Just as Mackenzie contends in relation to databases that it is not the individual data points that matter but the relations that can be created between them (Mackenzie), sampling involves such bringing into relation of medium-specific objects and activities. Small data collection techniques based on queries, hashtags, users or markers, however, do not relate to the whole population, but are defined by internal and comparative relations, whilst random samples are based on the relation between the sample and the full dataset. Second, thinking sampling as assembly, as relation-making between parts, wholes and the medium thus allows research to adjust its focus on either issue or medium dynamics. Small sample research, we suggested, comes with an investment into specific use scenarios and the subsequent validity of how the collection criteria themselves are grounded in medium specificity. The properties of a “relevant” collection strategy can be found in the extent to which use practices align with and can be utilised to create the collection. Conversely, a mismatch between medium-specific use practices and sample purposes may result in skewed findings. We thus suggest that sampling should not only attend to the internal relations between data points within collections, but also to the relation between the collection and a baseline. Third, in the absence of access to a full sample, we propose that the random sample provided through the Streaming API can serve as baseline for case approaches in principle. The experimental study discussed in our paper enabled the establishment of a starting point for future long-term data collection from which such baselines can be developed. It would allow to ground a priori assumptions intrinsic to small data collection design in medium-specificity and user practices, determining the relative importance of hashtags, URLs, @user mentions. Although requiring more detailed specification, such accounts of internal composition, co-occurrence or proximity of hashtags and keywords may provide foundations to situate case-samples, to adjust and specify queries or to approach hashtags as parts of wider issue ecologies. To facilitate this process logistically, we have made our scripts freely available. We thus suggest that sampling should not only attend to the internal or comparative relations, but, if possible, to the entire population – captured in the baseline – so that medium-specificity is reflected both in specific sampling techniques and the relative relevance of practices within the platform itself. Acknowledgements This project has been initiated in a Digital Methods Winter School project called “One Percent of Twitter” and we would like to thank our project members Esther Weltevrede, Julian Ausserhofer, Liliana Bounegru, Guilio Fagolini, Nicholas Makhortykh, and Lonneke van der Velden. Further gratitude goes to Erik Borra for his useful feedback and work on the DMI-TCAT. Finally, we would like to thank our reviewers for their constructive comments. References boyd, danah, and Kate Crawford. “Critical Questions for Big Data.” Information, Communication & Society 15.5 (2012): 662–679. ———, Scott Golder, and Gilad Lotan. “Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter.” 2010 43rd Hawaii International Conference on System Sciences. IEEE, (2010). 1–10. Bruns, Axel, and Stefan Stieglitz. “Quantitative Approaches to Comparing Communication Patterns on Twitter.” Journal of Technology in Human Services 30.3-4 (2012): 160–185. Bryman, Alan. Social Research Methods. Oxford University Press, (2012). Burgess, Jean, and Axel Bruns. “Twitter Archives and the Challenges of ‘Big Social Data’ for Media and Communication Research.” M/C Journal 15.5 (2012). 21 Apr. 2013 ‹http://journal.media-culture.org.au/index.php/mcjournal/article/viewArticle/561›. Callon, Michel, et al. “From Translations to Problematic Networks: An Introduction to Co-word Analysis.” Social Science Information 22.2 (1983): 191–235. Cha, Meeyoung, et al. “Measuring User Influence in Twitter: The Million Follower Fallacy.” ICWSM ’10: Proceedings of the International AAAI Conference on Weblogs and Social Media. (2010). Farber, Dan. “Twitter Hits 400 Million Tweets per Day, Mostly Mobile.” cnet. (2012). 25 Feb. 2013 ‹http://news.cnet.com/8301-1023_3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/›. Garcia-Gavilanes, Ruth, Daniele Quercia, and Alejandro Jaimes. “Cultural Dimensions in Twitter: Time, Individualism and Power.” (2006). 25 Feb. 2013 ‹http://www.ruthygarcia.com/papers/cikm2011.pdf›. Gilbert, Nigel. Researching Social Life. Sage, 2008. Gillespie, Tarleton. “The Politics of ‘Platforms’.” New Media & Society 12.3 (2010): 347–364. González-Bailón, Sandra, Ning Wang, and Alejandro Rivero. “Assessing the Bias in Communication Networks Sampled from Twitter.” 2012. 3 Mar. 2013 ‹http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2185134›. Gorard, Stephan. Quantitative Methods in Social Science. London: Continuum, 2003. Hayles, N. Katherine. My Mother Was a Computer: Digital Subjects and Literary Texts. Chicago: University of Chicago Press, 2005. Hong, Lichan, Gregorio Convertino, and Ed H Chi. “Language Matters in Twitter : A Large Scale Study Characterizing the Top Languages in Twitter Characterizing Differences Across Languages Including URLs and Hashtags.” Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (2011): 518–521. Huang, Jeff, Katherine M. Thornton, and Efthimis N. Efthimiadis. “Conversational Tagging in Twitter.” Proceedings of the 21st ACM Conference on Hypertext and Hypermedia – HT ’10 (2010): 173. Krishnamurthy, Balachander, Phillipa Gill, and Martin Arlitt. “A Few Chirps about Twitter.” Proceedings of the First Workshop on Online Social Networks – WOSP ’08. New York: ACM Press, 2008. 19. Krishnaiah, P R, and C.R. Rao. Handbook of Statistics. Amsterdam: Elsevier Science Publishers, 1987. Langlois, Ganaele, et al. “Mapping Commercial Web 2 . 0 Worlds: Towards a New Critical Ontogenesis.” Fibreculture 14 (2009): 1–14. Mackenzie, Adrian. “More Parts than Elements: How Databases Multiply.” Environment and Planning D: Society and Space 30.2 (2012): 335 – 350. Marres, Noortje, and Esther Weltevrede. “Scraping the Social? Issues in Real-time Social Research.” Journal of Cultural Economy (2012): 1–52. Marwick, Alice, and danah boyd. “To See and Be Seen: Celebrity Practice on Twitter.” Convergence: The International Journal of Research into New Media Technologies 17.2 (2011): 139–158. Montfort, Nick, and Ian Bogost. Racing the Beam: The Atari Video Computer System. MIT Press, 2009. Noy, Chaim. “Sampling Knowledge: The Hermeneutics of Snowball Sampling in Qualitative Research.” International Journal of Social Research Methodology 11.4 (2008): 327–344. Papagelis, Manos, Gautam Das, and Nick Koudas. “Sampling Online Social Networks.” IEEE Transactions on Knowledge and Data Engineering 25.3 (2013): 662–676. Paßmann, Johannes, Thomas Boeschoten, and Mirko Tobias Schäfer. “The Gift of the Gab. Retweet Cartels and Gift Economies on Twitter.” Twitter and Society. Eds. Katrin Weller et al. New York: Peter Lang, 2013. Poblete, Barbara, et al. “Do All Birds Tweet the Same? Characterizing Twitter around the World Categories and Subject Descriptors.” 20th ACM Conference on Information and Knowledge Management, CIKM 2011, ACM, Glasgow, United Kingdom. 2011. 1025–1030. Rieder, Bernhard. “The Refraction Chamber: Twitter as Sphere and Network.” First Monday 11 (5 Nov. 2012). Rogers, Richard. The End of the Virtual – Digital Methods. Amsterdam: Amsterdam University Press, 2009. Savage, Mike, and Roger Burrows. “The Coming Crisis of Empirical Sociology.” Sociology 41.5 (2007): 885–899. Stalder, Felix. “Between Democracy and Spectacle: The Front-End and Back-End of the Social Web.” The Social Media Reader. Ed. Michael Mandiberg. New York: New York University Press, 2012. 242–256. Stieglitz, Stefan, and Nina Krüger. “Analysis of Sentiments in Corporate Twitter Communication – A Case Study on an Issue of Toyota.” ACIS 2011 Proceedings. (2011). Paper 29. Tumasjan, A., et al. “Election Forecasts with Twitter: How 140 Characters Reflect the Political Landscape.” Social Science Computer Review 29.4 (2010): 402–418. Tukey, John Wilder. Exploratory Data Analysis. New York: Addison-Wesley, 1977. Uprichard, Emma. “Sampling: Bridging Probability and Non-Probability Designs.” International Journal of Social Research Methodology 16.1 (2011): 1–11.
Article
During the course of several natural disasters in recent years, Twitter has been found to play an important role as an additional medium for many-to-many crisis communication. Emergency services are successfully using Twitter to inform the public about current developments, and are increasingly also attempting to source first-hand situational information from Twitter feeds (such as relevant hashtags). The further study of the uses of Twitter during natural disasters relies on the development of flexible and reliable research infrastructure for tracking and analysing Twitter feeds at scale and in close to real time, however. This article outlines two approaches to the development of such infrastructure: one which builds on the readily available open source platform yourTwapperkeeper to provide a low-cost, simple, and basic solution; and, one which establishes a more powerful and flexible framework by drawing on highly scaleable, state-of-the-art technology.
Article
This paper contributes to debates about the implications of digital technology for social research by proposing the concept of the redistribution of methods. In the context of digitization, I argue, social research becomes noticeably a distributed accomplishment: online platforms, users, devices and informational practices actively contribute to the performance of digital social research. This also applies more specifically to social research methods, and this paper explores the phenomenon in relation to two specific digital methods, online network and textual analysis, arguing that sociological research stands much to gain from engaging with their distribution, both normatively and analytically speaking. I distinguish four predominant views on the redistribution of digital social methods: methods-as-usual, big methods, virtual methods and digital methods. Taking up this last notion, I propose that a redistributive understanding of social research opens up a new approach to the re-mediation of social methods in digital environments. I develop this argument through a discussion of two particular online research platforms: the Issue Crawler, a web-based platform for hyperlink analysis, and the Co-Word Machine, an online tool of textual analysis currently under development. Both these tools re-mediate existing social methods, and both, I argue, involve the attempt to render specific methodology critiques effective in the online realm, namely critiques of the authority effects implicit in citation analysis. As such, these methods offer ways for social research to intervene critically in digital social research, and more specifically, to endorse and actively pursue the re-distribution of social methods online.