ChapterPDF Available

Have We Even Solved the First ‘Big Data Challenge?’ Practical Issues Concerning Data Collection and Visual Representation for Social Media Analytics

Authors:

Abstract and Figures

Thanks to an influx of data collection and analytic software, harvesting and visualizing ‘big’ social media data1 is becoming increasingly feasible as a method for social science researchers. Yet while there is an emerging body of work utilizing social media as a data resource, there are a number of computational issues affecting data collection. These issues may problematize any conclusions we draw from our research work, yet for the large part, they remain hidden from the researcher’s view. We contribute towards the burgeoning literature which critically addresses various fundamental concerns with big data (see boyd and Crawford, 2012; Murthy, 2013; Rogers, 2013). However, rather than focusing on epistemological, political or theoretical issues — these areas are very ably accounted for by the authors listed above, and others — we engage with a different concern: how technical aspects of computational tools for capturing and handling social media data may impact our readings of it. This chapter outlines and explores two such technical issues as they occur for data taken from Twitter.
Content may be subject to copyright.
1
PRE-PRINT VERSION
Published: Phillip Brooker , Julie Barnett , Timothy Cribbin & Sanjay Sharma (2015) ‘Have we
even solved the first 'big data challenge?' Practical issues concerning data collection & visual
representation for social media analytics’, in H. Snee et al. (Eds) Digital Methods for Social
Science: An Interdisciplinary Guide to Research Innovation. London: Palgrave MacMillan.
ISBN: 9781137453655.
Have We Even Solved the First 'Big Data Challenge?': Practical Issues Concerning Data
Collection and Visual Representation for Social Media Analytics
Abstract
The present chapter explores the technical and computational processes
through which social media data is shaped into research findings. The authors
make this argument by depicting the effects of two practical issues - API rate
limiting in Twitter data collection and the use of spatial mapping algorithms in
visualising those data - on resulting analyses. Such issues are not problematic
to social media analytic research; rather, they can be used as resources for
helping to characterise and understand the data at hand. Hence, the authors
work to demonstrate the value in incorporating these reflexive analyses of
technical and computational processes into our accounts; to advocate
thinking in assemblages as a requirement for making analytic claims with 'big'
social media data.
Introduction
Thanks to an influx of data collection and analytic software, harvesting and visualising 'big'
social media data
1
is becoming increasingly feasible as a method for social science
researchers. Yet whilst there is an emerging body of work utilising social media as a data
resource, there are a number of computational issues affecting data collection. These issues
may problematise any conclusions we draw from our research work, yet for the large part,
they remain hidden from the researcher's view. We contribute towards the burgeoning
literature which critically addresses various fundamental concerns with Big Data (see boyd
and Crawford, 2012; Murthy, 2013; and Rogers, 2013). However, rather than focus on
epistemological, political or theoretical issues these areas are very ably accounted for by
the authors listed above, and others we engage with a different concern: how technical
aspects of computational tools for capturing and handling social media data may impact our
readings of it. This chapter outlines and explores two such technical issues as they occur for
data taken from Twitter.
2
Throughout the chapter, we demonstrate a perspective consistent with Procter et al.'s
(2013) view that social researchers wishing to make sense of 'big' social media data should
have sufficient knowledge of the underlying concepts of the computational methods and
tools they are using, so as to be able to decide when and where to appropriately apply
them. Furthermore, we take heed of boyd and Crawford's suggestion that 'When
researchers approach a data set, they need to understand and publicly account for not
only the limits of the data set, but also the limits of which questions they can ask of a data
set and what interpretations are appropriate' (2012: 669-70). To this end, we highlight how
certain technical characteristics and constraints pertaining to the collection and processing
of Twitter data can impact on research and how an understanding of these factors might
lead to more robust accounts of such data.
Our aim is to demonstrate the mainstream relevance of a commonplace methodological
procedure in the social sciences; namely the self-critical reflexive analyses of our methods in
terms of their impact on our accounts of the subjects we study. Our goal here is to show the
importance of understanding the effects that technical processes may have on our readings
of data for all social scientists, not just for those with a background in computer science.
Without this understanding it is impossible to make sense of the data at hand. Hence, we
promote the idea of thinking of the investigative process as an 'assemblage' (Langlois, 2011;
Sharma, 2013) that draws together various social and technical (and other) factors into a
unified research process. Here, we refer to the ways in which the research process comes to
feature not only conceptual theoretical knowledge and inductive empirical findings, but also
how technical issues (such as API rate limiting and spatial mapping algorithms) contribute
towards the production of knowledge in multifarious complex ways. How such an
assemblage might operate will become clearer as we present our selected two technical
issues, and in the discussion that follows.
Reviewing the Field
The State of Social Media Analytics
With the field of social media analytics still in relative infancy there are few methodological
practices taken as standard. The tendency thus far has been to fit digital data to existing
'offline' ways of working. As O'Connor et al. note:
Often the online researcher has little in the way of research precedent to
use as a guide to practice online research and, as a result, online researchers
frequently turn to established offline practices (O'Connor et al., 2008: 276).
Working from this position, several authors characterise social media data as a special kind
of 'offline' social science data. For example, Hine argues that a key concern of social media
analytics is to avoid a loss of quality in data; 'Face-to-face interaction here becomes the gold
standard against which the performance of computer-mediated interaction is judged' (Hine,
3
2006: 4). This quality problem of social media data is a concern shared by many, with
comments being levelled at 'the lack of uniformity in how users fill in forms, fields, boxes
and other text entry spaces' (Rogers, 2013: 205); the representativeness and validity of the
data more generally (Tufekci, 2014); and the fact that the production of data is not
controlled and led by researchers but appears untamed 'in the wild' (Kitchin, 2013). Kitchin
notes:
The challenge of analysing Big Data is coping with abundance, exhaustivity
and variety, timeliness and dynamism, messiness and uncertainty, high
relationality, and the fact that much of what is generated has no specific
question in mind or is a by-product of another activity (Kitchin, 2014: 2).
Given the uncertain relationship between digital and 'offline' methods, it becomes
important to explore possible ways of rendering visible the characteristics of digital methods
to see how and where they fit into existing methodological practices. Our proposed
treatment of such data embraces the 'digitality' of researching in this area by advocating a
greater working familiarity with computational tools and methods. Emphatically, this is not
to say that the work of social media analytics can be reduced to the rote operating of
software (see Keegan (2013)). Kitchin summarises this tension:
For many...the digital humanities is fostering weak, surface analysis, rather
than deep, penetrating insight. It is overly reductionist and crude in its
techniques, sacrificing complexity, specificity, context, depth and critique for
scale, breadth, automation, descriptive patterns and the impression that
interpretation does not require deep contextual knowledge. (Kitchin, 2014:
8)
Yet we do not believe this is necessarily how social media analytics has to operate. On the
contrary, we advocate a mode of reading data that allows computational methods to pick
out areas of potential interest which then might be explored more intimately through closer
readings. To do this requires an understanding of how social media data can be affected by
technical processes as part of a wider assemblage. As Lewis et al. note, 'As scholars
increasingly take up such datasets, they may be better served by interweaving
computational and manual approaches throughout the overall process in order to maximise
the precision of algorithms and the context-sensitive evaluations of human coders' (2013:
49). By outlining how this kind of research process works 'on the shop floor', we hope to
foster a way of thinking about such technical issues which might facilitate the mainstream
usage of digital research methods generally.
We take steps towards respecifying 'big' social media work in this way by concentrating on
two issues. Firstly we demonstrate the effects of a data collection issue – the rate limiting of
Twitter's Application Programming Interfaces (APIs), which is a built-in restriction on the
4
flow of Twitter data that may interfere with analyses in significant ways. Secondly, we
remark upon the ways in which computational models (and the visualisations that represent
them) might shape our analytic readings of data. Those already working in the field are well
aware of these concerns, yet they do not routinely feature in published accounts of relevant
work. Consequently, such issues may stand as a barrier to entry by steepening an already
steep learning curve. Hence, we openly discuss two such issues, not necessarily as
problematic to social media analytics but as presenting an opportunity to make better use
of new and powerful data resources.
Addressing the First 'Big Data Challenge'
Before doing so it is useful to describe what we are referring to in the title as the First 'Big
Data Challenge'. One much vaunted promise of 'Big Data' is that we all now have the means
to access data from sources like Twitter, and to engage in analytic work on large data
corpora through processing that data into easily-digestible visualisations. Moreover, this
work does not necessarily require any special skill with computer science or programming –
there is a wealth of freely-available third-party software tools to do the 'behind the scenes'
technical work for us
2
. In this sense, the First 'Big Data Challenge' refers to a) having easy
access to big data, and b) the availability of tools that facilitate its analysis. Our question in
the title – whether or not we are yet in a position to close the book on this first challenge –
demonstrates our intention to probe such matters further: can we tap vast data resources
unfiltered? Is it really as simple as employing visual models to show us the results? Such
concerns are worked out through the process of doing social media analytics and acquiring
necessary relevant skills along the way. Our approach here is to more accountably explore
this process of doing social media analytics to help figure out what might count as
appropriate methods and methodologies.
But why is it important to render transparent what might be argued to be mundane
computational issues? Doing social media analytics with Twitter data necessitates an
interfacing with the mechanisms governing how users access Twitter data: the Twitter
Application Programming Interfaces (APIs). These APIs allow users to request to access
certain slices of Twitter data, according to various search strategies (i.e. by keyword, by
bibliographic/demographic information, by random selection, and so on). Moreover, once
investigators have personal copies of these data, they may then subject them to further
algorithmic processes to make their 'Big Data' analysable, e.g. in the rendering of statistical
information or in the production of visualisations and so on. In this way, computational
processes come to feature as essential elements in the production and construction of our
data and analyses. As Marres notes, this bringing together of different disciplinary ideas can
be equally productive and constricting:
[Digital] social research is noticeably marked by informational practices and
devices not of its own making, from the analytic measures built into online
5
platforms (e.g. the number of links, number of mentionings, follower counts),
to the visual forms embedded in visualization modules (the tag cloud). Online
social research is visibly a distributed accomplishment (Marres, 2012: 160).
This intertwining of technical issues and research methods is foregrounded for the social
sciences by Fielding and Lee, who argue that 'Social science has demonstrated how
technology both shapes and is shaped by social action. Research methods are no exception'
(2008: 505). As such, since there are computational processes governing data collection and
analysis, we may find ourselves better-armed to undertake research in the area if we
understand some of the finer points about how these tools and processes work. How
exactly do they restrict the data we can harvest? And how exactly do they shape the
statistics and visualisations and other analytic outputs which we use to understand that
data?
To this end, we now turn to a more pointed examination of two such issues the possible
effects of API rate limiting on data collected through Twitter, and the possible effects of
spatial mapping algorithms on data visualisations as they occur through the usage of a
bespoke social media analytics software suite, Chorus
3
. This serves to demonstrate the kinds
of issues investigators may find themselves contending with, as well as helping figure out
ways of handling them methodologically. Our reflexive focus on the research process itself is
very much a mainstream methodological practice of social scientists
4
– we seek to take a
self-critical view on the (opaque) process of undertaking research involving data collection
through the Twitter APIs and data visualisation using spatial mapping algorithms. However,
our approach sees such limitations not as obstacles to research to be overcome. Rather, we
discuss these issues as an exercise in learning the tools of the trade of social media analytics
and understanding how they construct the data we analyse, so as to be better able to deal
with them as part of our work.
Two Practical Issues
API Rate Limiting in Twitter Data Collection
Twitter data collection is a process mediated through Twitter's various APIs. For the
purposes of social media analytics, the APIs are the tools by which users can make requests
for specific types of data. This process of using Twitter's APIs to access data necessitates
that users write requests as a RESTful statement to return responses in a data interchange
format called JSON
5
, or that users take advantage of a third-party client which facilitates the
task for non-coders. However, though comprehensive collections of data are available for
purchase through data archiving services such as Gnip or DataSift, the Twitter APIs are not
completely openly available to users and developers. In fact, several restrictions on their
usage are in place; ostensibly this is to prevent misuse of the service by individual users.
One such restriction is Twitter's 'rate limiting' of an individual account's API usage, or the
6
rate at which a user can poll the API with requests for data matching a search query. Each
API has different rate limits and different software tools will handle those limits in different
ways
6
. Providing concrete and quantifiable definitions of these limits is, however, a hopeless
task. The ‘Search API’ (that our example below draws upon) is fairly well-documented in
terms of its limits, as we go on to discuss. However, the usage restrictions of all Twitter APIs
are variably dependent on contextual information; chiefly the volume of data any queries
will yield. Hence, for APIs other than Search, Twitter does not provide information as to the
exact limits they will impose. All of this makes understanding and navigating through the
Twitter APIs a labyrinthine process. We use data collected via the Search API – which
handles data based on keywords appearing in tweets and is the most commonly-used
Twitter API – to demonstrate how the API itself inevitably comes to feature in a burgeoning
assemblage built up by the research as it is undertaken.
At the time of writing, Twitter's Search API allows Chorus' data collection programme,
TweetCatcher Desktop (TCD), to make 450 requests every fifteen minutes. Each such
request has the potential to capture a maximum of 100 tweets. The Search API allows users
to retrieve historical data up to seven days prior to the initial request. On 23 July 2014, using
TCD we ran a very general search query for all usages of the term 'black', as a way of
exploring the topics and sub-topics of racialised tweet content. To capture the data, we
refreshed the query at various points over an approximately four-hour period so as to
ensure as comprehensive a dataset as the API would allow. This resulted in a dataset of
28,317 tweets featuring the term 'black'. Plotting the data over time in half-hour intervals
within Chorus' visual analytic programme TweetVis, it was clear that there were gaps in the
chronology of the conversation:
Figure 1: Chart to show volume of tweets mentioning “black” across time (half-hour
intervals)
What we see in Figure 1 is a striking reminder that Twitter's APIs are restricted, and that this
may have significant effects on the data we wish to capture through them. For high-volume
queries it is easy to come up against Twitter's API rate limits, such as during searches for
7
general terms like 'black', as well as for trending terms (e.g. 'Obama' in the run-up to the
2012 US presidential election). In this example, we were simply unable to keep up with the
pace of peoples' tweets; we were able to capture an average 118 tweets-per-minute over
the four hour period, whereas the actual conversation skipped along between 450-550
tweets per minute. Naturally, this left a sizeable chunk of data missing from our dataset (see
Figure 1). However, what is less immediately obvious in this rendering of the data are the
presence of other breaks in the flow of the conversation we captured, which become more
apparent when viewing at a finer level of granularity:
Figure 2: Chart to show volume of tweets mentioning “black” across time (two minute
intervals)
In Figure 2, we see the same data grouped into intervals of two minutes. Here we can
identify the same gap in the data as in Figure 1, but also an earlier gap which was previously
obscured when viewed with our earlier half-hour intervals. Hence, we now can detect a
probable disruption to the flow of data between 14:44:59 and 14:58:59 where only a
consistent chronology was visible (or at least presumed) before. It may simply be that
people tweeted fewer times during these minutes, though it is equally possible that it is at
this point we were reaching the limits of what the API would allow us to see. It is in fact
impossible to figure out what has happened from the data or visualisations themselves.
What does this ambiguity mean for social media analytics and social research involving
Twitter data? A key insight to draw from this demonstration is that comprehensive
collections of Twitter data are not freely available to researchers. Even where we may
assume we are capturing the entirety of a conversation, drilling down into finer levels of
granularity may show us otherwise. Furthermore, failing to recognise when these rate
limiting issues have occurred may be detrimental to the analyses we draw from our data.
Without due care and attention, we may find ourselves using falsely-derived
conceptualisations of data as chronologically consistent
7
.
8
It is important here to acknowledge that access to social media data is a highly politicised
issue largely driven by commercial concerns (boyd and Crawford, 2012; Rogers, 2013). In
this sense, it is a fallacy to believe that any data which is collected through Twitter's APIs
(rather than purchased) is complete: incompleteness and unrepresentativeness are
fundamental features purposefully built into the APIs to protect the primacy of Twitter's
approved data providers. Recognising, understanding and accounting for this is a key step in
acknowledging the research process as an unfurling assemblage of interconnected socio-
technical entities (of which the API is one, alongside any software and hardware used in the
undertaking of the work, the social media users whose posts make up the data, any social
theories we use to interpret the resulting findings, and so on). However, the incompleteness
and unrepresentativeness of social media data does not prevent us from accessing
meaningful insights. It is worth questioning our fetishising of data in this respect what do
we need a chronologically complete dataset for? And what can we do without one? Rather
than bemoan the purposes for what our data cannot be used, it may be more productive to
explore what it can do. Though the methodological and analytic possibilities are impossible
to encapsulate fully in the present chapter (in that they will depend largely on the questions
being addressed), one such approach is advanced in the following section. However, the
salient point remains that perhaps the best way to make sense of data is to attain a deep
understanding of how a dataset has been constructed, and use that understanding as a
resource for designing appropriate analytic approaches with which it may be dealt.
Spatial Mapping Algorithms in Twitter Data Visualisation
Clearly there are issues concerning data collection of which researchers in social media
analytics would do well to be aware. However, our endeavours in the field have also
revealed similar technical issues in data visualization, where collected data is given an
analytic relevance through algorithmic processing. There are already multitudinous tools for
visualising social media data – Gephi, NodeXL and Chorus for instance. Some of these utilise
spatial mapping algorithms – computational processes through which entities such as
individual words or connected arrays of tweeters (or indeed any other kind of 'node') are
located on a 2D visual plane in relation to each other. Though each software package
operates uniquely, a unifying feature of these algorithms is their use of mathematical
reasoning as a way of representing distinctions between nodes. For example, the Chorus
visual analytic suite – TweetVis – features (amongst other visualisations) a topical 'cluster
map' which uses a spatial mapping algorithm to plot individual words contained within
tweets, in regards to the frequency of co-occurrences words have with other words in the
dataset. In this map, words which commonly co-occur in tweets cluster together, thereby
forming distinct topical clusters through which users can navigate and explore. Here, the
algorithm is what constructs and constrains the map, and for users trying to read the
visualisation, the constructions and constraints of the map become an integral part of the
resulting analysis. We demonstrate the possible effects of this algorithmic constructing and
9
constraining on analyses, showing how an understanding of the technical goings-on of a
data visualisation is a necessary requirement for those wishing to view it sensibly through
the lens of an assemblage.
To exemplify what the effects of a spatial mapping algorithm might look like in the
undertaking of a social media analytics project, we draw on previous work
8
on 'racialised
hashtags' (in particular, the hashtag #notracist). With a dataset collecting all usages of the
term #notracist over an eight-month period (resulting in 24,853 tweets), we plotted a
cluster map of hashtags, to see which hashtags featured together more commonly:
Figure 3 - hashtag map of #notracist (labels given to hashtags featuring in =>1% of tweets)
10
What this map showed us is that there are two distinct types of hashtag being used across
the #notracist dataset: firstly, a collection of 'comedy' hashtags (including #funny and #lol
and others) located in a tightly-organised cluster near the centre of the map; and secondly,
an array of 'truth' hashtags (including #justsayin, #justsaying, #iswear, #truth, #fact and
others) appearing on the fringes of topical branches around the outskirts of the
visualisation
9
. Given that these different themes are located in different areas of the map –
'comedy' in a tight central cluster and 'truth' out towards the fringes of branches – we
developed an interest in finding out what exactly this difference might be. To drill into the
data further, we filtered the master #notracist dataset into two sub-sets: one containing
'comedy' hashtags and one containing 'truth' hashtags. We then plotted cluster maps for all
terms contained within either dataset:
Figure 4 - 'Comedy' term map (left) and 'truth' term map (right) (labels given to terms
featuring in =>4% of tweets)
It is clear that these two visualisations are very different from one another the 'comedy'
map is a messy aggregation of highly interconnected terms, whereas terms in the 'truth'
map are densely populated around the outer fringes of connecting branches. In order to
interpret what the two contrasting maps were telling us, we relied on an understanding of
what the algorithm had done with the data points. For the 'comedy' map, the terms used in
tweets were highly related to each other showing that #notracist comedy was a practice of
tweeting done with lots of different terms and hashtags being used in similar ways (i.e.
there are a number of 'stock' formats through which tweeters could accountably claim to be
'doing a joke'). However, for the 'truth' map, our understanding of what we could see in the
visualisation relied upon an understanding of the functions of Chorus' spatial mapping
11
algorithms. In this map, terms are chiefly located on the outer edges, as far apart from each
other as the algorithm will allow. This is demonstrated by the tree-like appearances of
topical branches, with virtually no connecting terms in the centre but a high density of
terms pushed out towards the edges of the 2D plane. This visible pushing of the boundaries
of the algorithm tells us a lot about the data we were working with. The terms used in the
'truth' tweets we identified are typically disconnected from each other and not used
together, and we can begin to characterise #notracist 'truth' tweets as evidence of a topic
that is not implicitly agreed-upon and which reflects a diverse array of strategies for
justifying a #notracist claim as a statement.
Again, we ask: what does this mean for social media analytics? Visualisations such as those
discussed here are designed to fit data into a model which serves to constrain our data and
analytic materials as well as give them visible structure
10
. Hence, our aim in describing the
processes through which such models construct and constrain analyses is to set out a
requirement for social media analytics that it explicitly account for these processes. For
example, in the #notracist data described above, it was fundamental to our understanding
of the data that we could recognise that 'truth' terms were pushing the Chorus cluster map
algorithm to its limits, and consequently use that information as a way of figuring out what
the 'truth' conversation was about. Crucially, the fact of our data being processed through
the algorithm in a certain way is precisely what helped us get to grips with what lay at the
heart of that data. Without this we would have been lost. Consequently we reiterate that
these intrusions of computational, technical and mathematical processes into our analyses
is not something to be resisted. Rather, they are necessary and productive elements of
social media analytics without which we would be unable to characterise the assemblage
through which the analysis had been shaped, and ultimately, unable to make adequate
sense of the data at hand.
Concluding Remarks
In this chapter, we have outlined how social media analytics incorporates data collection
and analytic work in ways which are thoroughly reliant on computational and technical
processes. Despite the provocative nature in which the question in the title of this piece was
posed, we believe investigators in the field are in a position to sensibly account for their
work in terms of these processes. Yet these issues are not something we have seen
discussed explicitly in the accounts produced thus far. Hence, though our points may be
frustratingly mundane to our peers (who may question why researchers would want to
write about the inner workings of APIs and algorithms) we nonetheless think it valuable to
discuss such things transparently as a means of promoting healthy and robust
methodologies for the emerging field.
To that end, rather than depict software tools in general and in abstract, we have
exemplified our ideas with reference to two specific issues arising out of the use of just one
12
software package. This, we hope, gives a flavour of the kinds of issues of which researchers
must be aware when working with digital data and associated software tools. It is our hope
that our accounts of two specific examples can demonstrate just how these kinds of issues
intersect with research work in a very direct way. Working in this way, we have shown how
technical and computational processes become a 'necessary evil' of the work. Only there is
nothing 'evil' about them. Rather, these same processes can be used as resources for
conducting (and figuring out how to conduct) analytic work in appropriate ways. However,
the use of these processes as analytic resources relies on our having a deep understanding
of what they are doing with our data, else we risk wrong-footing our analyses before we
even begin. As Manovich notes:
...you must have the right background to make use of this data. The
[analytic] software itself is free and readily available...but you need the
right training...and prior practical experience to get meaningful results
(Manovich, 2012: 472).
It is clear that it now becomes our job as researchers to equip ourselves with these
understandings of the technical processes on which our work relies, however much this may
take us outside of typical disciplinary boundaries.
All of this may make the analysis of social media data an infinitely more complex issue, in
that we are no longer really analysing only the data, but an assemblage (Langlois, 2011;
Sharma, 2013) of technical and social processes which coalesce to form the datasets and
visualisations we find before us. Concerning data collection, we have used the idea of an
assemblage to outline how the technical aspects of API rate limiting become built into social
media analytics research from the very beginning of the research process. Concerning data
visualisation and analysis, our described assemblage relied upon our conceptualisation of
computational processes as having (by necessity) a commitment to numerical
'understandings' of data and how those 'understandings' are translated into images to be
read by human researchers. We have no doubt that Chorus' way of doing things is only one
amongst many, and other such issues will invariably arise in a multitude of different ways
when doing social media analytics with other tools. As with any software tool, Chorus is not
'just a tool' – it engenders a particular way of thinking about social media data which
constructs and constrains analyses in equal amounts. As such, the modest goal of this
chapter has been to encourage readers to consider their research work not only in terms of
the results and findings to be drawn, but in relation to the myriad processes through which
those findings are mediated throughout the endeavour. We advocate thinking in
assemblages as a requirement for social media analytics generally. Furthermore, we have
shown the relevance of assemblages for mainstream purposes, and how the specific
properties of an assemblage might be uncovered through the deployment of a key
methodological principle reflexivity which has informed the present chapter from start
13
to finish. In using the idea of assemblages as a frame for undertaking investigative work,
analytic findings would be explicitly situated alongside deconstructions of the processes by
which tools are governed by big data and the processes by which those same tools govern
the generation of empirical findings. In this sense, we may find ourselves in the business of
handling data processes rather than data, and of reading visualisation processes rather than
visualisations. The final result of these processes – the compiled dataset or the visual
representation – are not objects in and of themselves, but are better thought of as a way of
demonstrating how an unfolding combination of human and computational research
processes has resulted in a selection of valid and defensible findings.
14
Bibliography
boyd, d. and K. Crawford (2012) 'Critical Questions for Big Data', Information,
Communication & Society, 15(5), 662-79.
Fielding, N. and R. Lee (2008) 'Qualitative e-Social Science/Cyber-Research' in N. Fielding, R.
Lee and G. Blank (eds.) The Sage Handbook of Online Research Methods. London:
Sage, pp. 491-506.
Hine, C. (2006) 'Virtual Methods and the Sociology of Cyber-Social-Scientific Knowledge' in
C. Hine (ed.) Virtual Methods: Issues in Social Research on the Internet. Oxford: Berg,
1-13.
Keegan, B. (2013) 'C-Level Executives Cry Out for Data Scientists', ComputerWeekly.com,
[ONLINE] Available at: http://www.computerweekly.com/news/2240205984/C-level-
executives-cry-out-for-data-scientists [Accessed 13 November 2014].
Langlois, G. (2011) 'Meaning, Semiotechnologies and Participatory Media', Culture Machine,
12, 1-27.
Lewis, S. C., R. Zamith & A. Hermida (2013) 'Content Analysis in an Era of Big Data: A Hybrid
Approach to Computational and Manual Methods', Journal of Broadcasting &
Electronic Media, 57(1), 34-52.
Lynch, M. (2000) 'Against Reflexivity as an Academic Virtue and Source of Privileged
Knowledge', Theory, Culture & Society, 17(3), 26-54.
Kitchin, R. (2014) 'Big Data, New Epistemologies and Paradigm Shifts', Big Data and Society,
1, 1-12.
Manovich, L. (2012) 'Trending: The Promises and Challenges of Big Social Data', in M. K. Gold
(ed.) Debates in the Digital Humanities. London: University of Minnesota Press, 460-
75.
Marres, N. (2012) 'The Redistribution of Methods: On Intervention in Digital Social Research,
Broadly Conceived', The Sociological Review, 60(S1), 139-65.
Murthy, D. (2013) Twitter. Cambridge: Polity Press.
O'Connor, H., C. Madge, R. Shaw and J. Wellens (2008) 'Internet-Based Interviewing' in N.
Fielding, R. Lee and G. Blank (eds.) The Sage Handbook of Online Research Methods.
London: Sage, 271-89.
15
Procter, R., F. Vis and A. Voss (2013) 'Reading the Riots on Twitter: Methodological
Innovation for the Analysis of Big Data', International Journal of Social Science
Research Methodology, 16(3), 197-214.
Rogers, R. (2013) Digital Methods. London: The MIT Press.
Sharma, S. (2013) 'Black Twitter? Racial Hashtags, Networks and Contagion', New
Formations, 78, 46-64.
Tufekci, Z. (2014) 'Big Questions for Social Media Big Data: Representativeness, Validity and
Other Methodological Pitfalls', Proceedings of the Eighth International AAAI
Conference on Weblogs and Social Media, AAAI Publications, 505-14.
1
We take 'big' social media data to refer to volumes of data too large to handle without computational
processing and which are derived from peoples' everyday usages of social media platforms such as Twitter.
2
The idea that social media analytics requires no specific skill in its practitioners is contestable – for instance,
Keegan (2013) notes that the information technology industry believes itself to be suffering from a lack of
trained data scientists. However, the point remains that there are lots of freely available social media analytics
tools with which investigators from any discipline can explore data, and it is no longer a steadfast requirement
for practitioners to have any significant skills in programming, data mining, data visualisation, and so on.
3
Chorus (see www.chorusanalytics.co.uk for further details) is a free-to-download data collection and visual
analytic software suite dealing with Twitter data for social media analytics. Chorus was developed at Brunel
University by a team including several of the authors of the present chapter (Dr. Tim Cribbin, Dr. Phillip
Brooker and Prof. Julie Barnett). The development of Chorus was supported in part through the MATCH
Programme (UK Engineering and Physical Sciences Research Council grants numbers GR/S29874/01,
EP/F063822/1 and EP/G012393/1).
4
Lynch (2000) however questions the utility of sociology's concern with self-analysis, arguing that it only need
be applied when something particularly interesting will result (as is the case with his own reflexive approach to
reflexivity, and we hope, the present chapter).
5
A RESTful statement is one which is written in adherence with REST (or Representational State Transfer)
principles, REST being the ubiquitous architectural style that standardises and underlies the world wide web. In
regard specifically to handling the Twitter APIs, RESTful statements are the commands by which API users can
speak to Twitter's servers to request specific slices of data, which are returned in JSON format. Readers
wishing to find out more about using the Twitter APIs and writing API requests manually should start with
Twitter’s own developer documentation, available at: https://dev.twitter.com/docs [accessed: 29 July 2014].
6
See https://dev.twitter.com/rest/public/rate-limiting [accessed: 17 November 2014] for a more detailed
account of the rate limiting Twitter applies to its APIs.
7
Other APIs may provide something more like a chronologically complete timeline for instance, the Twitter
Streaming API pushes ‘real-time’ data matching a query’s criteria. However, the Streaming API only provides a
percentage sample of tweets requested, where the actual percentage is unknowable and dependent on the
volume of tweets requested by the query and concurrent Twitter traffic. Hence, the only way to ensure
comprehensivity of a dataset without running into sampling issues is to purchase Twitter ‘Firehose’ data – this
alone politicises access to data to the extent that few can afford to ever see a comprehensive dataset.
8
See http://www.darkmatter101.org/wiki/notracist_twitter [accessed: 21 January 2015] for an informal
account of this project.
9
Though this chapter is not intended as an empirical study of these tweets, interested readers might wish to
note that the 'comedy' hashtags we identified were tweets designed by tweeters to be humorous, whereas
'truth' hashtags were designed to enforce a point that a tweets was 'just a fact' or 'just an observation' and so
on. Our analytic work explored the different practices through which users attempted to justify their claims
that a tweet was not racist by virtue of it being a joke or a statement of fact.
16
10
This might be likened to filling a glass with water. As with water taking the shape of the glass it is poured
into, the process of collecting and visualising social media data serves to furnish amorphous data with a
structure. However, as much as these technical processes construct data such that they become amenable to
analysis, it can be said that the same processes constrain data into singular readings – a circular glass gives
only a circular shape to the water, but what if other shapes would prove more interesting?
... SNA success can be attributed to the exponential growth and popularity OSN faced [4], with major OSN like Facebook and Instagram (IG) having billions of users [35,58]. Researchers developed a variety of tools for SNA [56]; however, elaborating the quintillion bytes of data generated every day [30] is far from trivial [9]. The computational limitations compel scientists to conduct studies on sub-samples of the population, often introducing bias and reducing the quality of the results [8]. ...
... This was especially necessary for AI-generated content, whose low quality might have invalidated a fair comparison with non-AI content. 9 Similarly, for the engagement plan, two honeypots adopt PLAN 0, two PLAN 1, and three PLAN 2. Table 1 summarizes the 21 honeypots settings. ...
... Passive followers only follow the page, but they do not engage further.9 The effort for the honeypot manager is limited to a quick approval, which could not be necessary with more advanced state-of-the-art models, e.g., DALL-E 2[1] or ChatGPT [52]. ...
Chapter
Social Honeypots are tools deployed in Online Social Networks (OSN) to attract malevolent activities performed by spammers and bots. To this end, their content is designed to be of maximum interest to malicious users. However, by choosing an appropriate content topic, this attractive mechanism could be extended to any OSN users, rather than only luring malicious actors. As a result, honeypots can be used to attract individuals interested in a wide range of topics, from sports and hobbies to more sensitive subjects like political views and conspiracies. With all these individuals gathered in one place, honeypot owners can conduct many analyses, from social to marketing studies.In this work, we introduce a novel concept of social honeypot for attracting OSN users interested in a generic target topic. We propose a framework based on fully-automated content generation strategies and engagement plans to mimic legit Instagram pages. To validate our framework, we created 21 self-managed social honeypots (i.e., pages) on Instagram, covering three topics, four content generation strategies, and three engaging plans. In nine weeks, our honeypots gathered a total of 753 followers, 5387 comments, and 15739 likes. These results demonstrate the validity of our approach, and through statistical analysis, we examine the characteristics of effective social honeypots.KeywordsSocial NetworksSocial HoneypotsInstagramUser ProfilingArtificial IntelligencePrivacy
... However, others have warned that perceiving social media data 'as a reflection of an underlying "social" reinforces the ideology of dataism or the myth of big data' [5]. While an empiricist epistemology may be attractive, it ignores the socio-technical forces involved in the construction of data sets (Brooker, et al., 2016b), the role of theory and human interpretation in extracting meanings in particular patterns and associations Edelmann, et al., 2020), and the importance of context and domain specific knowledge in assessing the significance of particular findings [6]. Scholars [7] advocate for an iterative abductive analytical reasoning in which existing theory, concepts and methodological expertise are used to direct the process of knowledge discovery. ...
... For example, Twitter/now X's conventions structure how people communicate, what they use the platform for, and tweeting practices can serve to re-organize communication, such as the 'folk origins' of the hashtag [15]. I recognise that windows into user behaviours and attitudes are inherently shaped by the sociotechnical affordances of the platform, and social media data we as researchers curate (Brooker, et al., 2016b). Further, as Beneito-Mantagut (2019) observed, 'social media expressions are performative social actions, so not free from observer or group effects ... For instance, publications in Twitter might be profoundly affected by reputation and promotion concerns'. ...
Article
Full-text available
Different aspects of interactions on social media — communication and action — imply distinctive ways of knowing the social world. I present a new methodological approach that utilizes ‘big’ social media data to understand politically salient issues such as the ‘messaging’ of migration on Twitter/now X. An iterative abductive interpretivist analytical strategy drawing on computational and qualitative social science techniques was applied to a corpus of 47,978 tweets created over five months around the time of lifting of temporary controls on free movement from Romania and Bulgaria to the U.K. in January 2014. Initial computational network analysis on the retweet action feature revealed a small number of highly influential users and a large proportion of isolated users (non-elites) who were never retweeted. Given paucity of understanding of how elite narratives on migration are absorbed, accepted or contested by non-elites, the next stage involved qualitative thematic analysis of a sub-sample of actual tweets (communication) from non-elites to understand meaning-making in views expressed. Qualitative analysis confirmed presence of highly polarised immigration attitudes amongst non-elites but also revealed their values and beliefs about national belonging. These findings prompted questions about what or who influences these values amongst non-elites and whether there are any structural differences in information flows amongst anti- and pro-immigration users. In the third stage, computational surface thematic mapping of different aspects of communication and action in the whole corpus revealed importance of the entire media environment but also differences in the presence or lack of echo-chambers amongst those expressing anti- or pro-immigrant sentiments. This article demonstrates the potential of cross-disciplinary analytical strategies when investigating politically salient issues on social media.
... SNA success can be attributed to the exponential growth and popularity OSN faced [5], with major OSN like Facebook and Instagram (IG) having billions of users [6,7]. Researchers developed a variety of tools for SNA [8]; however, elaborating the quintillion bytes of data generated every day [9] is far from trivial [10]. The computational limitations compel scientists to conduct studies on sub-samples of the population, often introducing bias and reducing the quality of the results [11]. ...
... No statistical difference is found for PLAN 2; indeed, only the sponsored content benefited (i.e., a few posts). 10 Last, we analyze the difference between the total amount of followers at the end of weeks 8th and 9th. PLAN 1 honeypots #Followers moved, on average, from 45.7 ± 19.1 of week 8th, to 60.7 ± 26.2 of week 9th, with no statistical difference. ...
Preprint
Full-text available
Social Honeypots are tools deployed in Online Social Networks (OSN) to attract malevolent activities performed by spammers and bots. To this end, their content is designed to be of maximum interest to malicious users. However, by choosing an appropriate content topic, this attractive mechanism could be extended to any OSN users, rather than only luring malicious actors. As a result, honeypots can be used to attract individuals interested in a wide range of topics, from sports and hobbies to more sensitive subjects like political views and conspiracies. With all these individuals gathered in one place, honeypot owners can conduct many analyses, from social to marketing studies. In this work, we introduce a novel concept of social honeypot for attracting OSN users interested in a generic target topic. We propose a framework based on fully-automated content generation strategies and engagement plans to mimic legit Instagram pages. To validate our framework, we created 21 self-managed social honeypots (i.e., pages) on Instagram, covering three topics, four content generation strategies, and three engaging plans. In nine weeks, our honeypots gathered a total of 753 followers, 5387 comments, and 15739 likes. These results demonstrate the validity of our approach, and through statistical analysis, we examine the characteristics of effective social honeypots.
... In western countries, Facebook, Instagram, and Twitter are the most popular, while in China, Weibo is more commonly used (Toivonen et al., 2019). Secondly, computational acquisition approaches could lead to uncertainty in retrieving data (Brooker et al., 2016). Official APIs provided by social media platforms can only access limited datasets and metadata for researchers (Joseph et al., 2014;Toivonen et al., 2019), and little is known about platform APIs' sampling algorithms (Joseph et al., 2014). ...
Article
Despite the growing body of studies on mining visitor perceptions and attitudes of national park visitors using social media data, few research investigated user demographics and its representative issues. This study assessed visitor demographics, including gender, age, racial groups, and origins of visitors in a U.S. national park through their Twitter user profiles, and compared the results to a traditional visitor survey. The results showed similar percentages of gender groups between Twitter user profiles and the traditional survey. However, significant differences existed across all age groups and all racial groups between the two data sources. Compared to the survey, the visitors identified from social media data were younger and from more diverse race groups. The lists of the top 10 states and countries of residency of visitors from the two data sources overlapped but had different orders. The findings indicated that social media data could only be a complementary data source due to its representative issues. The results allow researchers to explore social media users' demographics by advanced social data analytics. However, this study suggests that analyzing Twitter profile information, such as self-reported names and profile photos, requires special attention from researchers even if the data were publicly available. The authors recommend that future research should attend to the representative and private issues of social media data. Management implications: • Social media user profiles can be utilized for predicting users' demographics, such as gender, age, and racial groups. • Social media data can only be a conplementary data source to understand visitor demographics in future research. VISITOR DEMOGRAPHICS AND ORIGIN FROM TWITTER 3 • The ethical issues of social media data, including private domain and machine learning algorithms, need further discussion.
... Despite some notable strengths identified while using DRM as an alternative in social and humanities research, researchers should also be aware that digital data have some limitations, such as unstructured data and the tendency of general data derived (less specific). Thus, they require further efforts for data processing to meet the needs of the research's objectives (Blanke & Prescott, 2016;Brooker et al., 2016). Researchers may also need to spend another time to check the validity of digital data (Blanke & Prescott, 2016;Hutchinson, 2016). ...
Article
Full-text available
Unprecedented COVID-19 global pandemic entails uncertain conditions, which lead people to seek alternative solutions to make activities running accordingly. Limited movements due to travel restriction and health protection confine people’s activities, including the research process. Postponing research activities is arguably not the best solution for scholars, mainly while conducting data collection. In social and humanities research, researchers mostly undertake data collection through field studies and face-to-face communications prior to the pandemic. Social distancing procedures, however, encourage people to avoid close contacts and cancel visit plans. Thus, researchers are necessary to find an alternative method during uncertainty situation. Digital research method (DRM) seems to be a plausible way to keep research ongoing, although the implementation of that method in the developing countries’ research activities, compared to the developed countries, remains lagging behind. This research aims to position DRM amidst the pandemic situation and show possibilities in using this method as an alternative in the developing countries by taking Indonesia as a case study. Results show that to make DRM works in Indonesia, some requirements need to be fulfilled to meet academic standards, and whether relevant or not the research topic applies to be processed utilizing digital method are also essential to be considered.
... Social media has grown in impact and connects with not only peer-to-peer conversations but also with how companies (including platforms) communicate their businesses (Grover, Kar, & Janssen, 2019;Piore, 2001). Social media can be thought of as "a kind of living lab, which enables academics to collect large amounts of data generated in a real-world environment" (Stieglitz, Dang-Xuan, Bruns, & Neuberger, 2014, p. 90) and thereby links to the big data movement (Brooker, Barnett, Cribbin, & Sharma, 2016;Manyika et al., 2011). Along its development, research methodologies to capture and analyse social media posts have emerged, the latter described as social media analytics (SMA). ...
Article
Full-text available
The sharing economy has gained traction in several industry sectors by establishing ever-new platforms, with digital intermediation and peer-to-peer exchanges at the heart of the business model. Most research on the sharing economy concerns the phenomenon level or focuses on the operations of single platforms. This paper connects various sharing economy platforms by asking: How has the sharing economy spread to new platforms? The purpose of the paper is to explain the pattern of spread of the sharing economy business model. Findings point out a seamless, unobtrusive pattern echoing characteristics of the sharing economy business model across distant sectors to avoid competition while reproducing activities in ever-new resource settings. The paper continues the exploration of the sharing economy related to industrial marketing through moving from the individual platforms to the way they lead to new ones while acknowledging how the innovative model for new platforms is highly based on mandates created through acknowledging oneself as a role model successor. Such a spread mechanism redefines innovation newness, adaptation and diffusion, and raises new insights to understand how current business landscapes would be under the possible transition into a new logic of operations.
Book
Full-text available
Métodos digitales, analítica cultural, tecnopolítica, etnografía, digital e inteligencia artificial son solo algunas de las tendencias que conforman el panorama de renovación en la metodología de la investigación social y las humanidades. A partir de ellas, el autor plantea una mirada crítica y un periodo histórico concreto para seleccionar una serie de reflexiones, propuestas y aplicaciones de las formas de indagación que involucran entornos digitales, esquematizadas en cinco hábitats: dataísmo, visualidad, acción, densidades y experimentación. Con discusiones provenientes de la sociología, la antropología y la comunicación, entro otros conjuntos de saberes ya desbordados, se propone además desafiar las particularidades de cada hábitat para encontrar características transversales del nuevo escenario, entre las que destacan la centralidad de los ambientes computacionales. La flexibilidad y reflexividad de las estrategias de indagación, el derrumbe de fronteras – entre lo cuantitativo/cualitativo, lo grande/pequeño y lo extenso/breve – y la indeterminación disciplinar. La mirada general, finalmente, se orienta a imaginar tácticas para dilucidar las implicaciones culturales de fenómenos como la cuantificación, la mediatización algorítmica, la programación, la visualización y la multimodalidad, los datos masivos, los artefactos inteligentes, la mediatización y los procesos de subjetivación, los espacios inmersivos y aumentados, el procesamiento de lenguaje natural, la plataformización, el diseño de prototipos, el uso de software, las prácticas sociodigitales, los laboratorios de medios o las protestas digitales y las multitudes conectadas.
Chapter
Artificial Intelligence (AI) & Machine Learning (ML) is used everywhere 1 in daily life from speech recognition on our phones, targeted marketing strategies 2 & face recognition without the majority of society even realising it. The power of 3 technology and availability of data has exploded in recent years with new and exciting 4 technological advancements being developed every single day. This chapter aims to 5 highlight and critically evaluate the practical and ethical challenges brought about 6 by the advancement of AI and ML as well as identifying any research gaps and 7 suggesting areas where further exploration is required. 8
Article
The paper focuses on a specific form of activism – urban knitting – and analyses “Mettiamoci una pezza” (“Let’s patch it”), an initiative organized by a group of women activists from L’Aquila, Italy, for the 10th anniversary of the earthquake, not only to draw public attention to the state of the city but also to other social and political issues. To analyze the organizational infrastructure of this movement, a qualitative content analysis of the videos produced by the craftivists was conducted, to which was added an analysis of the initiative’s website. The findings show the main characteristics of this collective action, as well as the creation of a particular organizational infrastructure consisting of a network of networks, made possible thanks to digital media. This infrastructure enabled the achievements of outcomes that the organizations could not achieve independently, recomposing a widespread but fragmented activism, based not on a geographical proximity, but on common goals.
Chapter
In times of increasing societal significance of public discourse on social media, questions surrounding the possibilities of access to the data generated are becoming ever more pressing. Using the example of the short message service Twitter, the article describes not only how everyday individual observations can be recorded on social media, but also the changes to which the availability of socially relevant information is subject, if society has only selective access to it and modes of access are being transformed generally: digital data requires for its generation, storage and analysis – especially that of large amounts of data – the technical system of the computer, which selects data according to particular observational interests and allows it to appear in a new context.
Book
Full-text available
In Digital Methods, Richard Rogers proposes a methodological outlook for social and cultural scholarly research on the Web that seeks to move Internet research beyond the study of online culture. It is not a toolkit for Internet research, or operating instructions for a software package; it deals with broader questions. How can we study social media to learn something about society rather than about social media use? Rogers proposes repurposing Web-native techniques for research into cultural change and societal conditions. We can learn to reapply such “methods of the medium” as crawling and crowd sourcing, PageRank and similar algorithms, tag clouds and other visualizations; we can learn how they handle hits, likes, tags, date stamps, and other Web-native objects. By “thinking along” with devices and the objects they handle, digital research methods can follow the evolving methods of the medium. Rogers uses this new methodological outlook to examine such topics as the findings of inquiries into 9/11 search results, the recognition of climate change skeptics by climate-change-related Web sites, and the censorship of the Iranian Web. With Digital Methods, Rogers introduces a new vision and method for Internet research and at the same time applies them to the Web's objects of study, from tiny particles (hyperlinks) to large masses (social media).
Article
Full-text available
This article examines how the availability of Big Data, coupled with new data analytics, challenges established epistemologies across the sciences, social sciences and humanities, and assesses the extent to which they are engendering paradigm shifts across multiple disciplines. In particular, it critically explores new forms of empiricism that declare ‘the end of theory’, the creation of data-driven rather than knowledge-driven science, and the development of digital humanities and computational social sciences that propose radically different ways to make sense of culture, history, economy and society. It is argued that: (1) Big Data and new data analytics are disruptive innovations which are reconfiguring in many instances how research is conducted; and (2) there is an urgent need for wider critical reflection within the academy on the epistemological implications of the unfolding data revolution, a task that has barely begun to be tackled despite the rapid changes in research practices presently taking place. After critically reviewing emerging epistemological positions, it is contended that a potentially fruitful approach would be the development of a situated, reflexive and contextually nuanced epistemology.
Article
Full-text available
This essay foregrounds how technocultural assemblages - software platforms, algorithms, digital networks and affects - are constitutive of online racialized identities. Rather than being concerned with what online identities are in terms of ethno-racial representation and signification, we can explore how they are materialized via the technologies of online platforms. The essay focuses on the micro-blogging site of Twitter and the viral phenomenon of racialized hashtags - dubbed as 'Blacktags' - for example #onlyintheghetto or #ifsantawasblack. The circulation of these racialized hashtags is analyzed as the transmission of contagious meanings and affects, such as anti/racist humour, sentiment and social commentary. Blacktags as contagious digital objects play a role in constituting the 'Black Twitter' identities they articulate and interact with. Beyond conceiving Black Twitter as a group of preconstituted users tweeting racialized hashtags, Blacktags are instrumental in producing networked subjects which have the capacity to multiply the possibilities of being raced online. Thus, ethno-racial collective behaviours on the Twitter social media platform are grasped as emergent aggregations, materialized through the contagious social relations produced by the networked propagation of Blacktags.
Article
Full-text available
Massive datasets of communication are challenging traditional, human-driven approaches to content analysis. Computational methods present enticing solutions to these problems but in many cases are insufficient on their own. We argue that an approach blending computational and manual methods throughout the content analysis process may yield more fruitful results, and draw on a case study of news sourcing on Twitter to illustrate this hybrid approach in action. Careful combinations of computational and manual techniques can preserve the strengths of traditional content analysis, with its systematic rigor and contextual sensitivity, while also maximizing the large-scale capacity of Big Data and the algorithmic accuracy of computational methods.
Article
Full-text available
For social scientists, the widespread adoption of social media presents both an opportunity and a challenge. Data that can shed light on people’s habits, opinions and behaviour is available now on a scale never seen before, but this also means that it is impossible to analyse using conventional methodologies and tools. This article represents an experiment in applying a computationally assisted methodology to the analysis of a large corpus of tweets sent during the August 2011 riots in England.
Article
Full-text available
Large-scale databases of human activity in social media have captured scientific and policy attention, producing a flood of research and discussion. This paper considers methodological and conceptual challenges for this emergent field, with special attention to the validity and representativeness of social media big data analyses. Persistent issues include the over-emphasis of a single platform, Twitter, sampling biases arising from selection by hashtags, and vague and unrepresentative sampling frames. The socio-cultural complexity of user behavior aimed at algorithmic invisibility (such as subtweeting, mock-retweeting, use of "screen captures" for text, etc.) further complicate interpretation of big data social media. Other challenges include accounting for field effects, i.e. broadly consequential events that do not diffuse only through the network under study but affect the whole society. The application of network methods from other fields to the study of human social activity may not always be appropriate. The paper concludes with a call to action on practical steps to improve our analytic capacity in this promising, rapidly-growing field.
Article
Full-text available
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Given the rise of Big Data as a socio-technical phenomenon, we argue that it is necessary to critically interrogate its assumptions and biases. In this article, we offer six provocations to spark conversations about the issues of Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.
Article
The term Big Data is applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. This chapter addresses some of the theoretical and practical issues raised by the possibility of using massive amounts of social and cultural data in the humanities and social sciences. These observations are based on the author’s own experience working since 2007 with large cultural data sets at the Software Studies Initiative at the University of California, San Diego. The issues discussed include the differences between ‘deep data’ about a few people and ‘surface data’ about many people; getting access to transactional data; and the new “data analysis divide” between data experts and researchers without training in computer science.
Article
This paper contributes to debates about the implications of digital technology for social research by proposing the concept of the redistribution of methods. In the context of digitization, I argue, social research becomes noticeably a distributed accomplishment: online platforms, users, devices and informational practices actively contribute to the performance of digital social research. This also applies more specifically to social research methods, and this paper explores the phenomenon in relation to two specific digital methods, online network and textual analysis, arguing that sociological research stands much to gain from engaging with their distribution, both normatively and analytically speaking. I distinguish four predominant views on the redistribution of digital social methods: methods-as-usual, big methods, virtual methods and digital methods. Taking up this last notion, I propose that a redistributive understanding of social research opens up a new approach to the re-mediation of social methods in digital environments. I develop this argument through a discussion of two particular online research platforms: the Issue Crawler, a web-based platform for hyperlink analysis, and the Co-Word Machine, an online tool of textual analysis currently under development. Both these tools re-mediate existing social methods, and both, I argue, involve the attempt to render specific methodology critiques effective in the online realm, namely critiques of the authority effects implicit in citation analysis. As such, these methods offer ways for social research to intervene critically in digital social research, and more specifically, to endorse and actively pursue the re-distribution of social methods online.
Chapter
handbook is the first to provide comprehensive, up-to-the-minute coverage of contemporary and developing Internet and online social research methods, spanning both quantitative and qualitative research applications. The editors have brought together leading names in the field of online research to give a thoroughly up to date, practical coverage, richly illustrated with examples. The chapters cover both methodological and procedural themes, offering readers a sophisticated treatment of the practice and uses of Internet and online research that is grounded in the principles of research methodology. Beginning with an examination of the significance of the Internet as a research medium, the book goes on to cover research design, data capture, online surveys, virtual ethnography, and the internet as an archival resource, and concludes by looking at potential directions for the future of Internet and online research. The SAGE Handbook of Internet and Online Research Methods will be welcomed by anyone interested in the contemporary practice of computer-mediated research and scholarship. Postgraduates, researchers and methodologists from disciplines across the social sciences will find this an invaluable source of reference.