Programmed method: developing
a toolset for capturing and
Erik Borra and Bernhard Rieder
Department of Media Studies, University of Amsterdam,
Amsterdam, The Netherlands
Purpose – The purpose of this paper is to introduce Digital Methods Initiative Twitter Capture and
Analysis Toolset, a toolset for capturing and analyzing Twitter data. Instead of just presenting a
technical paper detailing the system, however, the authors argue that the type of data used for, as well
as the methods encoded in, computational systems have epistemological repercussions for research.
The authors thus aim at situating the development of the toolset in relation to methodological debates
in the social sciences and humanities.
Design/methodology/approach – The authors review the possibilities and limitations of existing
approaches to capture and analyze Twitter data in order to address the various ways in which
computational systems frame research. The authors then introduce the open-source toolset and put
forward an approach that embraces methodological diversity and epistemological plurality.
Findings – The authors find that design decisions and more general methodological reasoning can
and should go hand in hand when building tools for computational social science or digital humanities.
Practical implications – Besides methodological transparency, the software provides robust and
reproducible data capture and analysis, and interlinks with existing analytical software. Epistemic
plurality is emphasized by taking into account how Twitter structures information, by allowing for a
number of different sampling techniques, by enabling a variety of analytical approaches or paradigms,
and by facilitating work at the micro, meso, and macro levels.
Originality/value – The paper opens up critical debate by connecting tool design to fundamental
interrogations of methodology and its repercussions for the production of knowledge. The design of
the software is inspired by exchanges and debates with scholars from a variety of disciplines and
the attempt to propose a flexible and extensible tool that accommodates a wide array of
methodological approaches is directly motivated by the desire to keep computational work open for
various epistemic sensibilities.
Keywords Twitter, Computational social science, Data collection, Analysis, Digital humanities,
Paper type Conceptual paper
The relatively recent flourishing of computer-supported approaches to the study of
social and cultural phenomena – digital methods (Rogers, 2013), computational social
science (Lazer et al., 2009), digital humanities (Kirschenbaum, 2010), each with their
set of significant precursors – has led to an encounter between technology and
methodology that deeply affects the status and practice of research in the social
The current issue and full text archive of this journal is available at
Received 15 September 2013
Revised 20 December 2013
19 February 2014
27 February 2014
Accepted 11 March 2014
Aslib Journal of Information
Vol. 66 No. 3, 2014
r Emerald Group Publishing Limited
The authors would like to thank the special issue editors and the anonymous reviewers, as well
as Noortje Marres, David Moats, Richard Rogers, Natalia Sanchez Querubin, Emma Uprichard,
Lonneke van der Velden, and Esther Weltevrede for their useful comments, and Emile den Tex
for his technical contributions. This project has benefited from a grant by the ESRC Digital
Social Research Programme and the Centre for the Study of Invention and Social Process
(Goldsmiths, University of London).
sciences and humanities. Statistics, modeling, and other formal methods have
introduced strong elements of technicality long ago. But the study of very large sets
of highly dynamic data, which, unlike, e.g. surveys, are not explicitly produced for
scientific study, institutes computing as a methodological mediator (Latour, 2005), and
brings along ideas, artifacts, practices, and logistics tied to the technological in a more
far-reaching and radical fashion. Packages like SPSS have enabled, broadened, and
standardized the use of computers and software in social research since the 1960s
(Uprichard et al., 2008). The recent explosion in research employing data analysis
techniques, often focussed on social media and other online phenomena, however,
propels questions of toolmaking – software design, implementation, maintenance,
etc. – into the center of methodological debates and practices.
A number of commentators (Boyd and Crawford, 2012; Rieder and Ro
Puschmann and Burgess, 2014) have called attention to the issues arising from the use
of software to study data extracted from (mostly proprietary) software platforms that
enable and orchestrate expressions and interactions of sometimes hundreds of millions
of users. These issues include the methodological, epistemological, logistical, legal,
ethical, and political dimensions of what is increasingly referred to as “big data”
research. While such critical interrogation is necessary and productive, in this paper
we take a different approach to some of the issues raised, by introducing and
discussing an open-source, freely available data capture, and analysis platform for
the Twitter micro blogging service, the Digital Methods Initiative Twitter Capture and
Analysis Toolset (DMI-TCAT). Although we do not envision this research software
to be a “solution” to the many questions at hand, it encapsulates a number of
propositions and commitments that are indeed programmatic beyond the mere
technicalities at hand. A presentation of such a tool cannot leave technical matters
aside, but in this paper we attempt to productively link them to some of the broader
repercussions of software-based research of social and cultural phenomena.
Although Cioffi-Revilla’s assessment that “computational social science is an
instrument-enabled scientific discipline, in this respect scientifically similar to
microbiology, radio astronomy, or nanoscience” (Cioffi-Revilla, 2010, p. 260) needs
to be nuanced, the argument that “it is the instrument of investigation that drives the
development of theory and understanding” (Cioffi-Revilla, 2010) is not easy to dismiss
when looking at research dealing with Twitter. The large number and wide variety of
computational approaches, their status as mostly experimental tools, their application
in disciplines often unaccustomed to computational principles, the pervasiveness of
social media, and the speed of technological change – all these elements require us to
pay much more attention to our instruments than we have been accustomed to. Having
built such an instrument, we feel obliged to go beyond the presentation of architecture
or results and account for the way we think – or hope – that our tool “drives” research
in a more substantial way than solely solving particular technical and logistical
problems. This desire possibly betrays our disciplinary affiliation. Media studies, in
particular in its humanities bent, has long focussed on analyzing technologies
as media, that is, as artifacts or institutions that do not merely transport information,
but, by affecting the scale, speed, form, in short, the character of expression and
interaction, contribute to how societies and cultures assemble, operate, and produce
knowledge. Just as Winner (1980) pointed out that tools have politics too, we consider
a research toolset such as DMI-TCAT to have epistemic orientations that have
repercussions for the production of academic knowledge. Rather than glossing over
them, we want to bring them to the front.
With these elements in mind, our paper proceeds in two distinct steps:
We briefly summarize existing tools and approaches for Twitter analysis,
discuss how they relate to academic research, and develop a set of guidelines or
principles for our own contribution along the way.
We present the design and architecture of DMI-TCAT, show how it addresses
the concerns raised, and detail the analytical possibilities for Twitter research
it provides. Through all of this, the relationship between toolmaking and
methodology remains in focus.
2. Existing work
When highlighting the emerging antagonism between “Big Data rich” and “Big Data
poor,” Boyd and Crawford (2012) cite Twitter researcher Jimmy Lin as discouraging
“researchers from pursuing lines of inquiry that internal Twitter researchers could do
better.” This quote echoes – at least if taken out of context – Savage and Burrows’
(2007) diagnosis of a “coming crisis in empirical sociology”: a marginalization of
academic empirical work due to the ever increasing capacity and inclination
of “knowing capitalism” (Thrift, 2005) to collect large amounts of data and to deploy
a variety of methods to analyze them. However, instead of advocating retreat into the
realms of synthesizing theory, they call for “greater reflection on how sociologists can
best relate to the proliferation of social data gathered by others” (Savage and Burrows,
2007, p. 895) and for renewed involvement with the “politics of method” (p. 895) in both
academic and private research. Rather than leaving areas like social media research to
in-house scientists and marketers, we should be “critically engaging with the extensive
data sources which now exist, and not least, campaigning for access to such data where
they are currently private” (p. 896). We could not agree more with this assessment
and would like to emphasize that the crisis Savage and Burrows diagnose goes beyond
the question of access to data. The proliferation of actors involved in the analysis of
online data – private and academic, coming from a wide variety of disciplines – has
led to the formation of an epistemological battlefield where different paradigms,
methods, styles, and objectives struggle for interpretive agency, i.e. for the power to
produce (empirical) accounts of the ever expanding online domain. To be clear: the
various technical, legal, logistical, and even ethical stumbling blocks for data analysis,
and the ways in which the various actors are able or decide to react to them, have very
real consequences for the actual knowledge produced and circulated.
In order to situate our own contribution and to develop a number of guiding
principles we need to provide a short overview of existing strategies in Twitter
research and discuss their limitations. The rough groupings we make, which revolve
around logistical questions, precede concrete research designs and rather define
a particular methodological space in which such concrete designs are then formulated.
Twitter’s in-house research projects, or projects with cooperation agreements, have
direct access to the full Twitter archive and are in the luxurious position to not have to
worry about access to data, data completeness, or technical limitations – although legal
and ethical considerations linger. At the same time, they are utterly dependent on the
good will of the company. Their academic independence in terms of subject focus is
doubtful, the tools and techniques used are often proprietary and can thus not be
scrutinized, and only few projects will actually be selected in the first place.
Projects acquiring data through resellers such as DataSift or Gnip also gain access
to the full archive of tweets and their metadata. Cost, however, is the main limiting
factor to this approach: the pricy subscription models to those services are out of reach
for small- and mid-sized research groups. As Twitter donated its entire archive to the
US Library of Congress, a viable and cheap alternative may become available in
the future. Any project working with data sourced from these archives, however,
will have to rely on custom programming for analysis.
This brings us to online analytics platforms, which provide simple interfaces for
both data acquisition and analysis, and are oriented toward either academic
(e.g. DiscoverText, Truthy) or commercial research (e.g. Topsy, Twitonomy,
Hootsuite). Those who interface with data resellers, such as DiscoverText, are again
costly, but services collecting data through Twitter’s Application Programming
Interfaces (APIs) are often available for free or at reduced cost. These platforms and
their dashboard-like interfaces can be very practical and useful, but are generally
limited in terms of their analytical capacities, cannot be easily extended, and allow for
little or no data export to stand-alone analytics software. Most problematic is the fact
that they blackbox a large part of the research chain and generally follow a particular
paradigmatic orientation. We do not think that commercial platforms should be
dismissed outright, but it is clear that they are mainly focussing on the requirements
of marketing professionals, emphasizing lists of “top” or “influential” users and content
items. More academic platforms equally subscribe to specific paradigmatic approaches
coming with prior assumptions about both data, e.g. Truthy considering spam as noise
(McKelvey and Menczer, 2013), and method, e.g. DiscoverText focussing on the
classification of tweets and Truthy on information diffusion. As such, researchers
are restricted to their premises and analytical techniques.
If all that is needed is a set of tweets matching certain keywords, e.g. all the tweets
containing a hashtag for a specific event, ad hoc or project-based custom capturing
tools such as ScraperWiki, Google spreadsheets, or streamR are commonly used.
Just like custom programming, this approach affords flexibility, transparency,
and control, but results may be difficult to verify or reproduce, bugs can occur, and
significant technical skill needs to be acquired.
Two well-known examples of open-source capturing software, an approach that
retains transparency and (some) flexibility and control while reducing the need for
technical expertise, are 140kit and TwapperKeeper. Both started out as public online
services to capture, export, and – in the case of 140kit – analyze tweets, but had to close
down when Twitter changed its terms of service in 2011. The source code for both
projects was published online and yourTwapperKeeper (yTK) in particular has
been used by many humanities and social science scholars to capture tweets (see, e.g.
Bruns and Liang, 2012). To facilitate and standardize research with yTK, which comes
without built-in analytics, Bruns and Burgess (2012) published a set of useful GAWK
scripts and we therefore initially used yTK to capture tweets. But the less technically
inclined humanities and social science scholars we often work with found the GAWK
scripts too difficult to handle. Our attempt to build a simpler analytics platform on top
of yTK proved difficult: its database structure is not designed for fast analysis
and omits many fields returned by the API; its codebase is not updated on a regular
basis; data is not stored as UTF-8 and languages using non-Latin character sets thus
cannot be analyzed. Finally, we not only wanted to capture and analyze keyword based
samples of tweets but also user timelines, 1 percent samples, follower networks, and
other types of data available through Twitter’s API.
Reviewing the possibilities and limitations of existing tools led us to the decision
to build our own capture and analysis platform from the ground up. It also allowed us
to develop a set of guiding principles that translate into a series of decisions or
commitments on three interrelated levels. Concerning logistics, we attempt to lower the
barrier of entry to Twitter research by providing a freely available platform built
on publicly available data which requires little or no custom programming and scales
to data sets of hundreds of millions of tweets using consumer hardware. Regarding
epistemology, our tool emphasizes epistemic plurality by staying close to the units
defined by the Twitter platform instead of storing aggregates, by allowing for a
number of different sampling techniques, by enabling a variety of analytical
approaches or paradigms, and by facilitating work at the micro, meso, and macro
levels. On the level of methodology, finally, we provide robust and reproducible
data capture and analysis, allow easy import and export of data, interlink with existing
analytics software, and guarantee methodological transparency by publishing the
In the next section, we provide a more detailed description of our system and show
how these guiding principles have been translated into concrete design decisions.
In line with the general architecture of DMI-TCAT, this presentation is divided into
data capture, the way data are retrieved, enriched, and stored in a database, and data
analysis, which includes all analytical operations that can be performed on the stored
elements. While these two aspects have been developed in tandem, they are mostly
independent: it is possible to use the toolset to only capture data, e.g. as alternative to
yTK, or to only analyze them, e.g. by importing a data set captured with yTK. Figure 1
provides a basic overview of the system.
DMI-TCAT is written in PHP and organized around a MySQL database
positioned between the capture and analysis parts of the system. Data are retrieved
by different modules controlled in regular intervals by a supervisor script (using the
cron scheduler present in all Unix-like operating systems), which checks whether
the capturing processes are running and, if necessary, restarts them. A separate script
translates shortened URLs. Database contents are analyzed in a two-stage process: the
selection of a subsample precedes the application of various analytical techniques.
In the following section, the various techniques for data capture are discussed in
3.1 Data capture
Sampling, i.e. the selection of items from a population of cases or elements, is a central
concern when using online data. Because we are essentially dealing with data stored in
import user timeline
import tweets by id
import user network
Schema of the general
architecture of DMI-TCAT
information systems, in this case Twitter’s database, the properties of these systems
determine, to a large extent, selection and retrieval possibilities. More fundamentally,
they imply that the sociality they enable is structured and formalized: platforms like
Twitter define basic entities (tweets, users, lists, hashtags, etc.), their characteristics
(a tweet is no longer than 140 characters, a user account is associated with an image,
etc.), and possible actions (a tweet can be retweeted, a user followed, etc.). The
information system therefore goes a long way in framing what Uprichard calls
“the ontology of the case” (Uprichard, 2013), simply by defining which entities can
appear as a case in the first place and subsequently become part of a sample. As noted
earlier, researchers in media studies have long recognized that media itself affect the
character of expression and interaction passing through it. While, e.g. Facebook
provides a formal and functional definition of “group,” Twitter does not. That does not
mean that a research design cannot operationalize such a construct by other means,
e.g. by collecting accounts of a predefined group like members of a parliament,
but because the functional characteristics of the Twitter platform mold actual use
practices, the technical structuring of potential units of analysis is highly relevant.
Focussing on Web media, digital methods (Rogers, 2013) thus urge to pay attention to
the way in which digital objects are defined by and processed through online devices.
As data can be captured and stored in different ways the decisions made on this
level have repercussions for analytical possibilities further down the chain. Hence,
capturing tools already participate in the framing of the empirical as such. Attempting
to facilitate epistemological and methodological diversity, DMI-TCAT closely
follows Twitter’s specific information structures, leaving the “primary” material
untouched, while allowing for plasticity in sample design and easy ways to create
subsamples from captured data sets.
Apart from technical specifications, Twitter also defines and regulates the modes
and scope of access to any data (Puschmann and Burgess, 2014). Legal constraints, API
definitions, rate limits for query calls, whitelisting, and data sharing agreements
are among the many possibilities the company has to design the ways its data can
become part of a research project. Because APIs are designed to enhance Twitter’s
value as a commercial platform by allowing third-party developers to build
applications on top of it, the needs of researchers are not explicitly taken into
account. Toolsets like DMI-TCAT thus repurpose these technical interfaces for
research. The following section shows by which different technical pathways data
enter into the system.
3.1.1 Data acquisition. DMI-TCAT relies on Twitter’s APIs and is therefore bound to
their possibilities and limitations. While we do not require familiarity with these
technical interfaces, we notify users when problems occur, e.g. when rate limits
are exceeded. Our tool connects to Twitter using the tmhOAuth library and
retrieves tweets via both the streaming API and the REST API. We use the former
for three different sampling techniques. First, researchers can capture a “1 percent”
random sample of all tweets passing through Twitter, which can then be used for
macro- and meso-level investigations and for baselining samples retrieved by other
means (Gerlitz and Rieder, 2013). Second, we use the statuses/filter endpoint to
“track” tweets containing specific keywords in real-time, which is probably the most
common way to create a sample of tweets. To give researchers maximum flexibility
and specificity, a collection is defined in a so-called “query bin,” i.e. a list of tracking
criteria consisting of single or multiple keyword queries, hashtags, and specific
phrases. For example, a bin like (globalwarming, “global warming,” #IPCC) would
retrieve all tweets containing one of these three query elements and combine them into
a single data set, stored as a group of related tables in the database. Third, our system
allows for following tweets from a specified set of up to 5,000 users. This is particularly
interesting when studying a set of manually selected accounts, such as members
of a parliament or other expert lists.
One of the main limits of the streaming API is that it cannot provide historical
tweets. The REST API, however, enables a search for tweets up to about a week old
and although it explicitly omits an unknown percentage of tweets, a data set can be
started by retrieving tweets up to about a week old. While this is far from ideal, it
might be the only feasible way to record traces of an unanticipated event. In the same
spirit, we use search to fill gaps in data capture resulting from network outages or
other technical problems. Finally, the REST API allows for retrieving the last 3,200
tweets for each user in a set, providing a level of historicity for user samples, and the
retrieval of follower/followee networks.
Additionally, DMI-TCAT takes Twitter’s data sharing policy into account, which
allows for sharing of tweet ids but not of messages and metadata themselves.
Our tool is therefore able to reconstruct a data set from a list of ids and can export such
a list as well. Along the same line, we provide import scripts for yTK databases or a set
of Twitter JSON files captured by other means (e.g. streamR).
Taken together, these possibilities allow for a wide array of sampling techniques, in
line with the principle of methodological flexibility.
3.1.2 Data storage and performance. Following the arguments for paying close
attention to Twitter’s informational structures outlined above, our database layout
mimics the shape of the data returned by the API. This means that tweets and their
metadata, hashtags, URLs, and mentions are stored in separate tables. DMI-TCAT
therefore does not need to extract those entities anymore, making querying the
database much faster.
While there are indeed limits to the amount of data one can store and analyze
without moving into the complicated and costly world of distributed computing,
a well-designed and indexed database structure, combined with optimized database
queries, means that off-the-shelf consumer hardware can handle much larger quantities
of tweets than sometimes argued (Bruns and Liang, 2012). We are currently running
DMI-TCAT on a cheap Linux machine with four processor cores, a 512GB SSD,
and 32GB of RAM, using the default LAMP stack. At the time of writing, we have
captured over 700 million tweets and basic analyses for even the largest query
bins – over 50 million tweets in a single data set – generally complete in under a
minute, allowing for iterative approaches to analysis. More complex forms of analysis,
such as the creation of mention networks, can take several minutes to complete,
however. While we have not systematically evaluated how far our architecture can
scale, it seems safe to say that hundreds of millions of tweets in a single data set should
still be workable, but moving on to the next order of magnitude would certainly require
a fully distributed approach to tools and infrastructure that is beyond the scope of
3.1.3 Data enrichment. One area where our system strays from simply capturing
and storing data provided by the API is data enrichment. We currently follow two
directions: URL expansion and the addition of Klout scores. First, many URLs passing
through Twitter are shortened, and although Twitter provides the “final” URL for its
own shortening service, this is not the case for third party shortening services such
as bit.ly. However, in tradition with other digital methods tools (Rogers, 2010),
URLs and, in particular, domain names are considered as crucial components of
a tweet’s message and a robust means for actor identification and content qualification.
DMI-TCAT therefore includes a script that follows all URLs to their endpoint,
adds the location to the URL table, and extracts the domain name. Second, we provide
the option to retrieve users’ Klout score, a proprietary metric set in the sociometric
tradition that produces an “influence” rating based on data from eight different
social media platforms. While caution is in order when using proprietary metrics,
Klout scores are commonly used and offer a glimpse into users’ activities beyond the
3.2 Data analysis
In contrast to yTK, DMI-TCAT is not limited to capturing data but also provides
analytical techniques to researchers, in a way that strikes a balance between ease
of use and analytical flexibility. We try to enable approaches spanning the “three
major areas of analysis” (Bruns and Liang, 2012) in Twitter research – tweet statistics
and activity metrics, network analysis, and content analysis – but also facilitate
geographical analysis, ethnographic research, and even textual hermeneutics.
Our tool can thus be used in a wide variety of projects, including studies of
everyday conversation, breaking news, crisis communication, political activism
(citizen), journalism, second screen applications, lifestyle and brand communication,
information diffusion, social patterns, ideological frames, sentiment analysis,
prediction, and so forth.
Besides providing a variety of analytical pathways and facilitating the integration
of additional modules, we emphasize epistemic plurality – and lower the cost of
development – by embracing Marres’ (2012) assessment that “social research becomes
noticeably a distributed accomplishment: online platforms, users, devices, and
informational practices actively contribute to the performance of digital social
research” (Marres, 2012, p. 139). Pushing this “redistribution of method” further, we
enable the export of derived data in standard formats to be analyzed in software
packages, chosen by researchers themselves, over interactive interfaces and
ready-made (visual) outputs. This makes the tool less convenient for users without
experience in data analysis, but the recent emergence of tools such as the Gephi
graph visualization and manipulation software, which are at the same time easy to use
and much more powerful than any Web interface, justifies this compromise.
This points to a conundrum that any toolmaker faces: to what extent and in what
way does research software actually shape research practices? How to justify decisions
and how to organize the design process? While there are few firm guidelines, the above
describes a particular balance between methodological flexibility and ease of use that
is the outcome of constant interaction with students of a large MA class on digital
methods, with the participants of multiple workshops, and with several research
projects at the University of Amsterdam relying on DMI-TCAT. To accommodate
this interaction, we rely on an agile software development approach using rapid
prototyping, iterative updating, and a modular architecture. While decisions
necessarily have to be made, they can be shared in a flexible way because tool-making
and actual research remain tightly coupled. Most of the analytical outputs DMI-TCAT
provides were thus built in response to particular analytical requirements from either
ourselves or from researchers we have been working with. However, the collaborative
setting cannot fully alleviate the fact that toolmakers necessarily intervene deeply in the
epistemic process of methods development, by framing ideas in terms of feasibility, cost,
formalization, and so forth, but also by constantly translating and connecting social
science and humanities concepts to computational techniques.
In the following sections, we will detail the analytical techniques implemented
in DMI-TCAT, starting with sub-sampling from stored data and continuing with
summary presentations of different analytical outputs.
3.2.1 Sub-sampling. DMI-TCAT enables the flexible constitution of a subsample.
After selecting a data set, as defined by a query bin, different techniques to filter the
data set are available.
By sub-selecting from the data set, a user can zoom in on a specific time period or
on tweets matching certain criteria (Figure 2). She can choose to include only those
tweets matching a particular phrase such as a word, hashtag, or mention; she can
exclude tweets matching a specific phrase; finally, she can focus on tweets by
particular users or tweets mentioning a specific (part of a) URL. All input fields accept
multiple phrases or keywords to specify (AND) or expand (OR) the selection via
Boolean queries. After updating the overview, a summary of the selection is generated
(Figure 3). While these filters are far from exhaustive, they allow for both the
constitution of a subsample and what could be described as “interactive probing,” i.e.
the back and forth movement between query and overview that “progresses in an
iterative process of view creation, exploration, and refinement” (Heer and
Shneiderman, 2012). This also echoes Uprichard’s argument that social research
rarely proceeds in linear fashion, but that “cases are ‘made’, both conceptually
and empirically, by constantly and iteratively re-shaping and re-matching theory and
empirical evidence together” (Uprichard, 2013, p. 5).
In addition, sub-sampling can be seen as a means for non-destructive data cleaning
in the sense that tweets matching specific criteria can be excluded without being
deleted. While this is currently implemented in rudimentary fashion only, the question
of data cleaning is crucial for both the reliability of the data and the question
of epistemic plurality, or as Gitelman (2013, p. 5) puts it “data [y]needtobe
understood [y] according to the uses which they are and can be put”; one researcher’s
noise is another one’s object of study.
The overview interface (Figure 3) lists the current selection criteria and shows the
number of tweets in the subset, the number of distinct users, and the proportion
of tweets containing URLs. Additionally, a line graph shows the frequency of tweets,
distinct users, distinct locations, and geo-tagged tweets per hour or per day, depending
on the scope of the selection. If a search query is specified, a second line graph indicates
the relationship between the subset and the full dataset. The example in Figure 3 thus
not only shows the shrinking absolute number of tweets mentioning (snowden) in our
“prism” dataset over the month of July 2013, but also the relative decline of tweets
mentioning the whistleblower’s name in relation to the whole data set.
Subsequent analytical techniques apply to the selected subsample, although the full
data set is, indeed, a possible selection. With the exception of several interactive
modules, all analyses are provided as exports in standard tabulated formats or as
network files. Filenames include filters and settings used in the interface so that
researchers know how data has been derived and which software version was used.
The following four sections describe the various types of techniques currently
implemented in our tool and mimic the sections of the actual interface.
3.2.2 Tweet statistics and activity metrics. The first set of exports covers some of the
basic statistics of the sub-sample one may want to consult. To get a quick
characterization of the types of tweets in the sample a table is provided with the total
Select the data set:
URL (or part of URL):
(empty: containing any text)
(empty: exclude nothing)
(empty: from any user)
(empty: any or all URLs)
globalwarming --- 22.211.422 tweets from 2012-11-23 15:53:44 to 2014-02-17 10:08:51 732.995.484 tweets archived so far (and counting)
Screenshot of the “data
selection” part of the
Overview of your selection
prism (nsa, palanteer, prism, spying, cyberwar, wiretap)
(Part of) URL:
Number of tweets:
Number of distinct users:
Date and time are in GMT (London).
Screenshot of the
of the DMI-TCAT
number of tweets, the number of tweets with URLs, hashtags, and mentions, as well as
the number of retweets and the number of unique users in the selection. To characterize
user activity and visibility, a table is provided which lists the minimum, maximum,
average, and median number of tweets sent, the number of users mentioned, the
number of followers and followees, and the number of URLs tweeted, all per user.
Furthermore, we follow emerging standards in Twitter research and allow for easy
analysis of basic platform elements over time (cf. Bruns and Burgess, 2012). This
includes counts of hashtags, user tweets and mentions, URLs and domain names, as
well as retweets.
Each of these outputs can be segmented into hourly, daily, weekly, monthly, and
yearly intervals; self-chosen intervals – e.g. to delineate distinct periods – are possible
as well and permit fine-grained temporal analysis. Although a full discussion
of the various metrics and their uses has to be deferred to a future publication, it is
important to mention that certain outputs provide deeper analytical perspectives: the
hashtag-user output, for example, not only provides a hashtag count per time interval,
but also the number of distinct users sending tweets containing the hashtag, the
number of distinct users mentioned in tweets with the hashtag, and the number of
tweets mentioning the hashtag.
3.2.3 Tweet exports. This section of the interface regroups modules providing lists
of actual tweets for further analysis. A random set of a user-specified number of tweets
can facilitate content analysis by providing a sample of items to be (manually) coded
into categories or otherwise analyzed. It is also possible to simply export all tweets and
their metadata from the current selection or, alternatively, only those that have been
retweeted or come with geo-location data.
A statistical exploration of the data via the methods outlined in the previous section,
combined with different analyses of actual tweets enables powerful mixed methods
approaches (cf. Lewis et al., 2013). For example, researchers can export a chronological
list of the most retweeted messages, which is an interesting means to reconstruct and
narrate the timeline of an event (Rogers et al., 2009). From close reading to text mining,
the easy availability of actual tweets is crucial for both quantitative and qualitative
examinations of content.
3.2.4 Networks. The third set of outputs focusses on network perspectives and
produces outputs in either GEXF or GDF formats. The difference with statistical
approaches lies not so much in what is being looked at, but rather in how the data are
represented and analyzed. At the moment, the main focus lies on users, hashtags, and
URLs – and the various relationships between these entities. Two outputs represent
interaction networks where users are connected either through mentions or through
direct replies. Because these files can be opened in different graph analysis tools, a wide
variety of social network analysis techniques can be applied. This affords perspectives
on interaction patterns that go beyond mere frequency and allows, for example,
identifying cliques or sub-conversations.
A co-hashtag network output allows for a type of content analysis that focusses on
relationships between these signal words: if two hashtags appear in the same tweet, a
link is established; the more often they co-occur, the stronger the link. By applying
network analysis techniques, one can get an overview of the subject variety in a set of
tweets and analyze relationships between subtopics.
Finally, there are a number of bipartite graphs outputs – networks containing
entities of two kinds constituted through co-occurrence in a tweet – in particular
hashtag-user and hashtag-URL/domain networks. Figure 4 provides a short example
for the latter. These techniques allow for the structural analysis of relationships
between entities and are particularly useful for locating actors (users or domains) in
relation to issues (hashtags).
3.2.5 Experimental modules. Because DMI-TCAT is modular, it is easy to add new
analytical techniques. A series of experimental modules provide interactive interfaces
or dashboards rather than file exports. A detailed presentation is beyond the scope of
this paper, but our “cascade” module, a means to visually explore temporal structures
and retweet patterns, serves as an example.
This module (Figure 5) provides a ground-level view of tweet activity by either
charting every single tweet in the current selection or only those above a certain
retweet threshold. User accounts are distributed vertically; tweets – shown as
dots – are spread out horizontally over time. Lines indicate retweets. At the top we see
the typical activity pattern of a retweet bot (line of dark dots). This view requires
a large screen and is limited to small data selections, but because tweet text becomes
visible when hovering over a node, it allows for the close reading of a conversation or
debate and, in a sense, links to ethnographic observation.
In this paper, we have described a tool to capture and analyze data from Twitter. We
have shown how particular design decisions can be related to wider considerations
concerning the role of software in academic research. Our proposition is not simply
a “solution” to a set of “problems.” Rather, it is an attempt to connect the question
of toolmaking for social and cultural research to debates dealing with the “politics of
method” (Savage and Burrows, 2007, p. 895) in ways that are not merely theoretical or
Gephi graph visualization
of a bipartite graph from
our “datascience” data
set: hashtags are in dark
gray and domain names
in light gray
critical. Platforms like Twitter pose a number of fundamental challenges to scholars.
Beyond being attentive to these questions, we have to ask how the tools we use to
practice research can proactively take those challenges into account. The canonical
style of both research reporting and technical publication leave little space to
connect to fundamental interrogations of methodology and its repercussions for the
production of knowledge. But the very nature of computational methods, which deeply
entangle research design with technical work, requires us to engage toolmaking from
different angles. When even small decisions in database design can lead to huge
differences in performance, potentially having profound effects on the way researchers
interact with the tools and data, we realize that even details in implementation can
have substantial epistemic effects. Are we missing a genre of academic text that
permits a combination of technical presentation and general methodological
discussion? The direct relationship between engineering questions and
methodological considerations is a subject that is often neglected and merits much
more critical debate.
This paper in no way suggests to serve as a blueprint for such an endeavor, but in
order to engage the enormous methodological challenges, we feel compelled to
experiment with forms of writing and academic expression that attempt to span
various disciplinary traditions, even if this might lead to disorientation and friction
with established conventions.
of the “cascade”
The design of DMI-TCAT is inspired by exchanges and debates with scholars
from a variety of disciplines and our attempt to propose a flexible and extensible
tool that accommodates a wide array of methodological approaches is directly
motivated by the desire to keep computational work open for various epistemic
1. Available at https://github.com/digitalmethodsinitiative/dmi-tcat (accessed February 19,
2. http://discovertext.com (accessed September 14, 2013).
3. http://truthy.indiana.edu (accessed September 14, 2013).
4. http://topsy.com (accessed September 14, 2013).
5. http://twitonomy.com (accessed September 14, 2013).
6. https://hootsuite.com (accessed September 14, 2013).
7. https://scraperwiki.com (accessed September 14, 2013).
8. http://cran.r-project.org/web/packages/streamR/ (accessed September 14, 2013).
9. https://github.com/WebEcologyProject/140kit (accessed September 14, 2013).
10. https://github.com/540co/yourTwapperKeeper (accessed September 14, 2013).
11. DMI-TCAT stores every field returned by Twitter (around 40 per tweet – if the tweet
contains mentions, hashtags, and URLs) while yTK only stores 13 of the most basic fields
per tweet and excludes fields such as retweet_id, in_reply_to_status_id, entities, and many
fields related to the sender of the tweet.
12. The API documentation is thus an essential part of DMI-TCAT’s documentation. https://
dev.twitter.com/docs/platform-objects (accessed September 1, 2013) specifies Twitter’s
entities and possible actions.
13. tmhOAuth by Matt Harris is available at https://github.com/themattharris/tmhOAuth. It
implements all possible calls to the Twitter APIs in PHP (accessed September 1, 2013).
14. For an explanation of the differences, see https://dev.twitter.com/docs/streaming-apis
(accessed September 12, 2013).
15. https://dev.twitter.com/docs/api/1.1/get/statuses/sample (accessed September 12, 2013).
16. https://dev.twitter.com/docs/api/1.1/post/statuses/filter (accessed September 12, 2013).
17. Tracking criteria follow https://dev.twitter.com/docs/streaming-apis/parameters#track
(accessed September 12, 2013).
18. https://dev.twitter.com/docs/api/1.1 (accessed September 12, 2013).
19. “The Search API is not complete index [sic] of all Tweets, but instead an index of recent
Tweets. At the moment that index includes between 6-9 days of Tweets.” https://dev.twitter.
com/docs/using-search (accessed September 12, 2013).
20. According to I.4.A from https://dev.twitter.com/terms/api-terms (accessed September 10,
2013), “If you provide downloadable datasets of Twitter Content or an API that returns
Twitter Content, you may only return IDs (including tweet IDs and user IDs).”
21. We aim to always incorporate all fields returned by Twitter’s APIs. We store all data in
UTF-8 and are thus able to capture and analyze tweets in any language. See https://dev.
twitter.com/docs/counting-characters (accessed September 10, 2013) for more information.
22. http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29 (accessed September 12,
2013). DMI-TCAT has been tested on Linux and OSX.
23. http://klout.com/corp/how-it-works (accessed September 7, 2013).
24. Available as free software on http://gephi.org (accessed September 12, 2013).
25. We identify retweets by using Twitter’s retweet_id API field as well as by grouping
“identical” tweets, thus also including possible manual retweets.
Boyd, D. and Crawford, K. (2012), “Critical questions for big data”, Information, Communication
& Society, Vol. 15 No. 5, pp. 662-679.
Bruns, A. and Burgess, J. (2012), “Researching news discussion on Twitter: new methodologies”,
Journalism Studies, Vol. 13 Nos 5-6, pp. 801-814.
Bruns, A. and Liang, Y.E. (2012), “Tools and methods for capturing Twitter data during natural
disasters”, First Monday, Vol. 17 No. 4, p. 5, available at: http://firstmonday.org/ojs/index.
php/fm/article/view/3937/3193 (accessed February 26, 2014).
Cioffi-Revilla, C. (2010), “Computational social science”, Wiley Interdisciplinary Reviews:
Computational Statistics, Vol. 2 No. 3, pp. 259-271.
Gerlitz, C. and Rieder, B. (2013), “Mining one percent of twitter: collections, baselines, sampling”,
M/C Journal, Vol. 16 No. 2, available at: www.journal.media-culture.org.au/index.php/
mcjournal/article/view/620 (accessed February 26, 2014).
Gitelman, L. (Ed.) (2013), Raw Data’ Is an Oxymoron, MIT Press, Cambridge, MA.
Heer, J. and Shneiderman, B. (2012), “Interactive dynamics for visual analysis”, Queue, Vol. 10
No. 2, pp. 30-55.
Kirschenbaum, M.G. (2010), “What is digital humanities and what’s it doing in English
departments?”, ADE Bulletin, Vol. 150 No. 7, pp. 55-61.
Latour, B. (2005), Reassembling the Social: An Introduction to Actor-Network-Theory, Oxford
University Press, New York, NY.
Lazer, D., Pentland, A.(S.), Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Christakis, N.,
Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D. and
Van Alstyne, M. (2009), “Life in the network: the coming age of computational social science”,
Science, Vol. 323 No. 5915, pp. 721-723.
Lewis, S.C., Zamith, R. and Hermida, A. (2013), “Content analysis in an era of big data: a hybrid
approach to computational and manual methods”, Journal of Broadcasting & Electronic
Media, Vol. 57 No. 1, pp. 34-52.
McKelvey, K. and Menczer, F. (2013), “Design and prototyping of a social media observatory”,
Proceedings of the 22nd international conference on World Wide Web companion,
Rio de Janeiro, International World Wide Web Conferences Steering Committee, Geneva,
Marres, N. (2012), “The redistribution of methods: on intervention in digital social research,
broadly conceived”, The Sociological Review, Vol. 60 No. S1, pp. 139-165.
Puschmann, C. and Burgess, J. (2014), “The politics of twitter data”, in Weller, K., et al. (Eds),
Twitter and Society, Peter Lang Publishing, New York, NY, pp. 43-54.
Rieder, B. and Ro
hle, T. (2012), “Digital methods: five challenges”, in Berry, D.M. (Ed.),
Understanding Digital Humanities, Palgrave Macmillan, Basingstoke, pp. 67-84.
Rogers, R. (2010), “Mapping public web space with the Issuecrawler”, in Brossard, C. and
Reber, B. (Eds), Digital Cognitive Technologies: Epistemology and Knowledge Society, Wiley,
London, pp. 115-126.
Rogers, R. (2013), Digital Methods, MIT Press, Cambridge, MA.
Rogers, R., Jansen, F., Stevenson, M. and Weltevrede, E. (2009), “Mapping democracy”, in
Finlay, A. (Ed.), Global Information Society Watch 2009, Association for Progressive
Communications and Hivos, Uruguay, pp. 47-57, available at: www.giswatch.org/fr/node/
158 (accessed February 26, 2014).
Savage, M. and Burrows, R. (2007), “The coming crisis of empirical sociology”, Sociology, Vol. 41
No. 5, pp. 885-899.
Thrift, N. (2005), Knowing Capitalism, Sage, London.
Uprichard, E. (2013), “Sampling: bridging probability and non-probability designs”, International
Journal of Social Research Methodology, Vol. 16 No. 1, pp. 1-11.
Uprichard, E., Burrows, R. and Byrne, D. (2008), “SPSS as an “inscription device”: from causality
to description?”, The Sociological Review, Vol. 56 No. 4, pp. 606-622.
Winner, L. (1980), “Do artifacts have politics?”, Daedalus, Vol. 109 No. 1, pp. 121-136.
About the authors
Erik Borra is a PhD Candidate and Lecturer at the University of Amsterdam’s MA Program in
New Media. His research concerns the Web as a source of data for social and cultural research,
paying particular attention to search engine queries, Wikipedia edit histories, and social
networks. Erik is also scientific programmer for the Digital Methods Initiative and is currently
involved in the European research project “Electronic Maps to Assist Public Science” (EMAPS).
Erik Borra is the corresponding author and can be contacted at: firstname.lastname@example.org
Dr Bernhard Rieder is an Associate Professor of New Media at the University of Amsterdam.
Besides developing and theorizing digital methods, his research focusses on the history, theory,
and politics of software, particularly on the role of algorithms in social processes and the
production of knowledge. He has worked as a Web Programmer on various projects and is
currently writing a book that investigates the history and cultural significance of information
To purchase reprints of this article please e-mail: email@example.com
Or visit our web site for further details: www.emeraldinsight.com/reprints