ArticlePDF Available

Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon

Authors:

Abstract

The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamouring for access to the massive quantities of information produced by and about people, things, and their interactions. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what 'research' means? Given the rise of Big Data as a socio-technical phenomenon, we argue that it is necessary to critically interrogate its assumptions and biases. In this article, we offer six provocations to spark conversations about the issues of Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
1
Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon
danah boyd
Microsoft Research and New York University
dmb@microsoft.com
Kate Crawford
University of New South Wales
k.crawford@unsw.edu.au
Technology is neither good nor bad; nor is it neutral...technology’s interaction with
the social ecology is such that technical developments frequently have
environmental, social, and human consequences that go far beyond the immediate
purposes of the technical devices and practices themselves.
Melvin Kranzberg (1986, p. 545)
We need to open a discourse!where there is no effective discourse now!about
the varying temporalities, spatialities and materialities that we might represent in
our databases, with a view to designing for maximum flexibility and allowing as
possible for an emergent polyphony and polychrony. Raw data is both an oxymoron
and a bad idea; to the contrary, data should be cooked with care.
Geoffrey Bowker (2005, p. 183-184)
2
The era of Big Data is underway. Computer scientists, physicists, economists,
mathematicians, political scientists, bio-informaticists, sociologists, and other scholars
are clamoring for access to the massive quantities of information produced by and about
people, things, and their interactions. Diverse groups argue about the potential benefits
and costs of analyzing genetic sequences, social media interactions, health records, phone
logs, government records, and other digital traces left by people. Significant questions
emerge. Will large-scale search data help us create better tools, services, and public
goods? Or will it usher in a new wave of privacy incursions and invasive marketing?
Will data analytics help us understand online communities and political movements? Or
will analytics be used to track protesters and suppress speech? Will large quantities of
data transform how we study human communication and culture, or narrow the palette of
research options and alter what ‘research’ means?
Big Data is, in many ways, a poor term. As Lev Manovich (2011) observes, it has been
used in the sciences to refer to data sets large enough to require supercomputers, but what
once required such machines can now be analyzed on desktop computers with standard
software. There is little doubt that the quantities of data now available are often quite
large, but that is not the defining characteristic of this new data ecosystem. In fact, some
of the data encompassed by Big Data (e.g., all Twitter messages about a particular topic)
are not nearly as large as earlier data sets that were not considered Big Data (e.g., census
data). Big Data is less about data that is big than it is about a capacity to search,
aggregate, and cross-reference large data sets.
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
3
We define Big Data1 as a cultural, technological, and scholarly phenomenon that rests on
the interplay of:
1) Technology: maximizing computation power and algorithmic accuracy to gather,
analyze, link, and compare large data sets.
2) Analysis: drawing on large data sets to identify patterns in order to make
economic, social, technical, and legal claims.
3) Mythology: the widespread belief that large data sets offer a higher form of
intelligence and knowledge that can generate insights that were previously
impossible, with the aura of truth, objectivity, and accuracy.
Like other socio-technical phenomena, Big Data triggers both utopian and dystopian
rhetoric. On one hand, Big Data is seen as a powerful tool to address various societal ills,
offering the potential of new insights into areas as diverse as cancer research, terrorism,
and climate change. On the other, Big Data is seen as a troubling manifestation of Big
Brother, enabling invasions of privacy, decreased civil freedoms, and increased state and
corporate control. As with all socio-technical phenomena, the currents of hope and fear
often obscure the more nuanced and subtle shifts that are underway.
Computerized databases are not new. The U.S. Bureau of the Census deployed the
world’s first automated processing equipment in 1890–the punch-card machine
(Anderson 1988). Relational databases emerged in the 1960s (Fry and Sibley 1974).
Personal computing and the internet have made it possible for a wider range of people –
1 We have chosen to capitalized the term “Big Data” throughout this article to make it clear that it is the
phenomenon we are discussing.
4
including scholars, marketers, governmental agencies, educational institutions, and
motivated individuals – to produce, share, interact with, and organize data. This has
resulted in what Mike Savage and Roger Burrows (2007) describe as a crisis in empirical
sociology. Data sets that were once obscure and difficult to manage – and, thus, only of
interest to social scientists – are now being aggregated and made easily accessible to
anyone who is curious, regardless of their training.
How we handle the emergence of an era of Big Data is critical. While the phenomenon is
taking place in an environment of uncertainty and rapid change, current decisions will
shape the future. With the increased automation of data collection and analysis – as well
as algorithms that can extract and illustrate large-scale patterns in human behavior – it is
necessary to ask which systems are driving these practices, and which are regulating
them. Lawrence Lessig (1999) argues that social systems are regulated by four forces:
market, law, social norms, and architecture – or, in the case of technology, code. When it
comes to Big Data, these four forces are frequently at odds. The market sees Big Data as
pure opportunity: marketers use it to target advertising, insurance providers use it to
optimize their offerings, and Wall Street bankers use it to read the market. Legislation has
already been proposed to curb the collection and retention of data, usually over concerns
about privacy (e.g., the U.S. Do Not Track Online Act of 2011). Features like
personalization allow rapid access to more relevant information, but they present difficult
ethical questions and fragment the public in troubling ways (Pariser 2011).
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
5
There are some significant and insightful studies currently being done that involve Big
Data, but it is still necessary to ask critical questions about what all this data means, who
gets access to what data, how data analysis is deployed, and to what ends. In this article,
we offer six provocations to spark conversations about the issues of Big Data. We are
social scientists and media studies scholars who are in regular conversation with
computer scientists and informatics experts. The questions that we ask are hard ones
without easy answers, although we also describe different pitfalls that may seem obvious
to social scientists but are often surprising to those from different disciplines. Due to our
interest in and experience with social media, our focus here is mainly on Big Data in
social media context. That said, we believe that the questions we are asking are also
important to those in other fields. We also recognize that the questions we are asking are
just the beginning and we hope that this article will spark others to question the
assumptions embedded in Big Data. Researchers in all areas – including computer
science, business, and medicine – have a stake in the computational culture of Big Data
precisely because of its extended reach of influence and potential within multiple
disciplines. We believe that it is time to start critically interrogating this phenomenon, its
assumptions, and its biases.
1. Big Data Changes the Definition of Knowledge
In the early decades of the 20th century, Henry Ford devised a manufacturing system of
mass production, using specialized machinery and standardized products. It quickly
6
became the dominant vision of technological progress. ‘Fordism’ meant automation and
assembly lines; for decades onward, this became the orthodoxy of manufacturing: out
with skilled craftspeople and slow work, in with a new machine-made era (Baca 2004).
But it was more than just a new set of tools. The 20th century was marked by Fordism at
a cellular level: it produced a new understanding of labor, the human relationship to
work, and society at large.
Big Data not only refers to very large data sets and the tools and procedures used to
manipulate and analyze them, but also to a computational turn in thought and research
(Burkholder 1992). Just as Ford changed the way we made cars – and then transformed
work itself – Big Data has emerged a system of knowledge that is already changing the
objects of knowledge, while also having the power to inform how we understand human
networks and community. ‘Change the instruments, and you will change the entire social
theory that goes with them,’ Latour reminds us (2009, p. 9).
Big Data creates a radical shift in how we think about research. Commenting on
computational social science, Lazer et al argue that it offers ‘the capacity to collect and
analyze data with an unprecedented breadth and depth and scale’ (2009, p. 722). It is not
just a matter of scale nor is it enough to consider it in terms of proximity, or what Moretti
(2007) refers to as distant or close analysis of texts. Rather, it is a profound change at the
levels of epistemology and ethics. Big Data reframes key questions about the constitution
of knowledge, the processes of research, how we should engage with information, and the
nature and the categorization of reality. Just as du Gay and Pryke note that ‘accounting
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
7
tools...do not simply aid the measurement of economic activity, they shape the reality
they measure’ (2002, pp. 12-13), so Big Data stakes out new terrains of objects, methods
of knowing, and definitions of social life.
Speaking in praise of what he terms ‘The Petabyte Age’, Chris Anderson, Editor-in-Chief
of Wired, writes:
This is a world where massive amounts of data and applied mathematics replace
every other tool that might be brought to bear. Out with every theory of human
behavior, from linguistics to sociology. Forget taxonomy, ontology, and
psychology. Who knows why people do what they do? The point is they do it, and
we can track and measure it with unprecedented fidelity. With enough data, the
numbers speak for themselves. (2008)
Do numbers speak for themselves? We believe the answer is ‘no’. Significantly,
Anderson’s sweeping dismissal of all other theories and disciplines is a tell: it reveals an
arrogant undercurrent in many Big Data debates where other forms of analysis are too
easily sidelined. Other methods for ascertaining why people do things, write things, or
make things are lost in the sheer volume of numbers. This is not a space that has been
welcoming to older forms of intellectual craft. As David Berry (2011, p. 8) writes, Big
Data provides ‘destablising amounts of knowledge and information that lack the
regulating force of philosophy.’ Instead of philosophy – which Kant saw as the rational
basis for all institutions – ‘computationality might then be understood as an ontotheology,
8
creating a new ontological “epoch” as a new historical constellation of intelligibility’
(Berry 2011, p. 12).
We must ask difficult questions of Big Data’s models of intelligibility before they
crystallize into new orthodoxies. If we return to Ford, his innovation was using the
assembly line to break down interconnected, holistic tasks into simple, atomized,
mechanistic ones. He did this by designing specialized tools that strongly predetermined
and limited the action of the worker. Similarly, the specialized tools of Big Data also
have their own inbuilt limitations and restrictions. For example, Twitter and Facebook are
examples of Big Data sources that offer very poor archiving and search functions.
Consequently, researchers are much more likely to focus on something in the present or
immediate past – tracking reactions to an election, TV finale or natural disaster – because
of the sheer difficulty or impossibility of accessing older data.
If we are observing the automation of particular kinds of research functions, then we
must consider the inbuilt flaws of the machine tools. It is not enough to simply ask, as
Anderson has suggested ‘what can science learn from Google?’, but to ask how the
harvesters of Big Data might change the meaning of learning, and what new possibilities
and new limitations may come with these systems of knowing.
2. Claims to Objectivity and Accuracy are Misleading
‘Numbers, numbers, numbers,’ writes Latour (2010). ‘Sociology has been obsessed by
the goal of becoming a quantitative science.’ Sociology has never reached this goal, in
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
9
Latour’s view, because of where it draws the line between what is and is not quantifiable
knowledge in the social domain.
Big Data offers the humanistic disciplines a new way to claim the status of quantitative
science and objective method. It makes many more social spaces quantifiable. In reality,
working with Big Data is still subjective, and what it quantifies does not necessarily have
a closer claim on objective truth – particularly when considering messages from social
media sites. But there remains a mistaken belief that qualitative researchers are in the
business of interpreting stories and quantitative researchers are in the business
of producing facts. In this way, Big Data risks reinscribing established divisions in the
long running debates about scientific method and the legitimacy of social science and
humanistic inquiry.
The notion of objectivity has been a central question for the philosophy of science and
early debates about the scientific method (Durkheim 1895). Claims to objectivity suggest
an adherence to the sphere of objects, to things as they exist in and for themselves.
Subjectivity, on the other hand, is viewed with suspicion, colored as it is with various
forms of individual and social conditioning. The scientific method attempts to remove
itself from the subjective domain through the application of a dispassionate process
whereby hypotheses are proposed and tested, eventually resulting in improvements in
knowledge. Nonetheless, claims to objectivity are necessarily made by subjects and are
based on subjective observations and choices.
10
All researchers are interpreters of data. As Lisa Gitelman (2011) observes, data needs to
be imagined as data in the first instance, and this process of the imagination of data
entails an interpretative base: ‘every discipline and disciplinary institution has its own
norms and standards for the imagination of data.’ As computational scientists have
started engaging in acts of social science, there is a tendency to claim their work as the
business of facts and not interpretation. A model may be mathematically sound, an
experiment may seem valid, but as soon as a researcher seeks to understand what it
means, the process of interpretation has begun. This is not to say that all interpretations
are created equal, but rather that not all numbers are neutral.
The design decisions that determine what will be measured also stem from interpretation.
For example, in the case of social media data, there is a ‘data cleaning’ process: making
decisions about what attributes and variables will be counted, and which will be ignored.
This process is inherently subjective. As Bollier explains,
As a large mass of raw information, Big Data is not self-explanatory. And yet the
specific methodologies for interpreting the data are open to all sorts of
philosophical debate. Can the data represent an ‘objective truth’ or is any
interpretation necessarily biased by some subjective filter or the way that data is
‘cleaned?’ (2010, p. 13)
In addition to this question, there is the issue of data errors. Large data sets from Internet
sources are often unreliable, prone to outages and losses, and these errors and gaps are
magnified when multiple data sets are used together. Social scientists have a long history
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
11
of asking critical questions about the collection of data and trying to account for any
biases in their data (Cain & Finch 1981; Clifford & Marcus 1986). This requires
understanding the properties and limits of a dataset, regardless of its size. A dataset may
have many millions of pieces of data, but this does not mean it is random or
representative. To make statistical claims about a dataset, we need to know where data is
coming from; it is similarly important to know and account for the weaknesses in that
data. Furthermore, researchers must be able to account for the biases in their
interpretation of the data. To do so requires recognizing that one’s identity and
perspective informs one’s analysis (Behar & Gordon 1996).
Too often, Big Data enables the practice of apophenia: seeing patterns where none
actually exist, simply because enormous quantities of data can offer connections that
radiate in all directions. In one notable example, David Leinweber demonstrated that data
mining techniques could show a strong but spurious correlation between the changes in
the S&P 500 stock index and butter production in Bangladesh (2007).
Interpretation is at the center of data analysis. Regardless of the size of a data, it is subject
to limitation and bias. Without those biases and limitations being understood and
outlined, misinterpretation is the result. Data analysis is most effective when researchers
take account of the complex methodological processes that underlie the analysis of that
data.
12
3. Bigger Data are Not Always Better Data
Social scientists have long argued that what makes their work rigorous is rooted in their
systematic approach to data collection and analysis (McClosky 1985). Ethnographers
focus on reflexively accounting for bias in their interpretations. Experimentalists control
and standardize the design of their experiment. Survey researchers drill down on
sampling mechanisms and question bias. Quantitative researchers weigh up statistical
significance. These are but a few of the ways in which social scientists try to assess the
validity of each other’s work. Just because Big Data presents us with large quantities of
data does not mean that methodological issues are no longer relevant. Understanding
sample, for example, is more important now than ever.
Twitter provides an example in the context of a statistical analysis. Because it is easy to
obtain – or scrape – Twitter data, scholars have used Twitter to examine a wide variety of
patterns (e.g., mood rhythms [Golder & Macy 2011], media event engagement [Shamma,
Kennedy & Churchill 2010], political uprisings [Lotan et al. 2011], and conversational
interactions [Wu et al. 2011]). While many scholars are conscientious about discussing
the limitations of Twitter data in their publications, the public discourse around such
research tends to focus on the raw number of tweets available. Even news coverage of
scholarship tends to focus on how many millions of ‘people’ were studied (e.g., [Wang
2011]).
Twitter does not represent ‘all people’, and it is an error to assume ‘people’ and ‘Twitter
users’ are synonymous: they are a very particular sub-set. Neither is the population using
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
13
Twitter representative of the global population. Nor can we assume that accounts and
users are equivalent. Some users have multiple accounts, while some accounts are used
by multiple people. Some people never establish an account, and simply access Twitter
via the web. Some accounts are ‘bots’ that produce automated content without directly
involving a person. Furthermore, the notion of an ‘active’ account is problematic. While
some users post content frequently through Twitter, others participate as ‘listeners’
(Crawford 2009, p. 532). Twitter Inc. has revealed that 40 percent of active users sign in
just to listen (Twitter 2011). The very meanings of ‘user’ and ‘participation’ and ‘active’
need to be critically examined.
Big data and whole data are also not the same. Without taking into account the sample of
a dataset, the size of the dataset is meaningless. For example, a researcher may seek to
understand the topical frequency of tweets, yet if Twitter removes all tweets that contain
problematic words or content – such as references to pornography or spam – from the
stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it
is not a representative sample as the data is skewed from the beginning.
It is also hard to understand the sample when the source is uncertain. Twitter Inc. makes
a fraction of its material available to the public through its APIs2. The ‘firehose’
theoretically contains all public tweets ever posted and explicitly excludes any tweet that
2 API stands for application programming interface; this refers to a set of tools that developers can use to
access structured data.
14
a user chose to make private or ‘protected.’ Yet, some publicly accessible tweets are also
missing from the firehose. Although a handful of companies have access to the firehose,
very few researchers have this level of access. Most either have access to a ‘gardenhose’
(roughly 10% of public tweets), a ‘spritzer’ (roughly 1% of public tweets), or have used
‘white-listed’ accounts where they could use the APIs to get access to different subsets of
content from the public stream.3 It is not clear what tweets are included in these different
data streams or sampling them represents. It could be that the API pulls a random sample
of tweets or that it pulls the first few thousand tweets per hour or that it only pulls tweets
from a particular segment of the network graph. Without knowing, it is difficult for
researchers to make claims about the quality of the data that they are analyzing. Is the
data representative of all tweets? No, because it excludes tweets from protected
accounts.4 But is the data representative of all public tweets? Perhaps, but not
necessarily.
Twitter has become a popular source for mining Big Data, but working with Twitter data
has serious methodological challenges that are rarely addressed by those who embrace it.
When researchers approach a dataset, they need to understand – and publicly account for
– not only the limits of the dataset, but also the limits of which questions they can ask of
a dataset and what interpretations are appropriate.
3 Details of what Twitter provides can be found at https://dev.twitter.com/docs/streaming-api/methods
White-listed accounts were commonly used by researchers, but they are no longer available.
4 The percentage of protected accounts is unknown, although attempts to identify protected accounts
suggest that under 10% of accounts are protected (Meeder et al. 2010).
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
15
This is especially true when researchers combine multiple large datasets. This does not
mean that combining data doesn’t offer valuable insights – studies like those by
Alessandro Acquisti and Ralph Gross (2009) are powerful, as they reveal how public
databases can be combined to produce serious privacy violations, such as revealing an
individual’s Social Security number. Yet, as Jesper Anderson, co-founder of open
financial data store FreeRisk, explains: combining data from multiple sources creates
unique challenges. ‘Every one of those sources is error-prone…I think we are just
magnifying that problem [when we combine multiple data sets]’ (Bollier 2010, p. 13).
Finally, during this computational turn, it is increasingly important to recognize the value
of ‘small data’. Research insights can be found at any level, including at very modest
scales. In some cases, focusing just on a single individual can be extraordinarily valuable.
Take, for example, the work of Tiffany Veinot (2007), who followed one worker - a vault
inspector at a hydroelectric utility company - in order to understand the information
practices of blue-collar worker. In doing this unusual study, Veinot reframed the
definition of ‘information practices’ away from the usual focus on early-adopter, white-
collar workers, to spaces outside of the offices and urban context. Her work tells a story
that could not be discovered by farming millions of Facebook or Twitter accounts, and
contributes to the research field in a significant way, despite the smallest possible
participant count. The size of data should fit the research question being asked; in some
cases, small is best.
16
4. Taken Out of Context, Big Data Loses its Meaning
Because large data sets are can be modeled, data is often reduced to what can fit into a
mathematical model. Yet, taken out of context, data lose meaning and value. The rise of
social network sites prompted an industry-driven obsession with the ‘socal graph’.
Thousands of researchers have flocked to Twitter and Facebook and other social media to
analyze connections between messages and accounts, making claims about social
networks. Yet, the relations displayed through social media are not necessarily equivalent
to the sociograms and kinship networks that sociologists and anthropologists have been
investigating since the 1930s (Radcliffe-Brown 1940; Freemand 2006). The ability to
represent relationships between people as a graph does not mean that they convey
equivalent information.
Historically, sociologists and anthropologists collected data about people’s relationships
through surveys, interviews, observations, and experiments. Using this data, they
focused on describing people’s ‘personal networks’ – the set of relationships that
individuals develop and maintain (Fischer 1982). These connections were evaluated
based on a series of measures developed over time to identify personal connections. Big
Data introduces two new popular types of social networks derived from data traces:
‘articulated networks’ and ‘behavioral networks.’
Articulated networks are those that result from people specifying their contacts through
technical mechanisms like email or cell phone address books, instant messaging buddy
lists, ‘Friends’ lists on social network sites, and ‘Follower’ lists on other social media
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
17
genres. The motivations that people have for adding someone to each of these lists vary
widely, but the result is that these lists can include friends, colleagues, acquaintances,
celebrities, friends-of-friends, public figures, and interesting strangers.
Behavioral networks are derived from communication patterns, cell coordinates, and
social media interactions (Meiss et al. 2008; Onnela et al. 2007). These might include
people who text message one another, those who are tagged in photos together on
Facebook, people who email one another, and people who are physically in the same
space, at least according to their cell phone.
Both behavioral and articulated networks have great value to researchers, but they are not
equivalent to personal networks. For example, although contested, the concept of ‘tie
strength’ is understood to indicate the importance of individual relationships (Granovetter
1973). When mobile phone data suggests that workers spend more time with colleagues
than their spouse, this does not necessarily imply that colleagues are more important than
spouses. Measuring tie strength through frequency or public articulation is a common
mistake: tie strength – and many of the theories built around it – is a subtle reckoning in
how people understand and value their relationships with other people. Not every
connection is equivalent to every other connection, and neither does frequency of contact
indicate strength of relationship. Further, the absence of a connection does not necessarily
indicate a relationship should be made.
18
Data is not generic. There is value to analyzing data abstractions, yet retaining context
remains critical, particularly for certain lines of inquiry. Context is hard to interpret at
scale and even harder to maintain when data is reduced to fit into a model. Managing
context in light of Big Data will be an ongoing challenge.
5. Just Because it is Accessible Doesn’t Make it Ethical
In 2006, a Harvard-based research group started gathering the profiles of 1,700 college-
based Facebook users to study how their interests and friendships changed over time
(Lewis et al. 2008). This supposedly anonymous data was released to the world, allowing
other researchers to explore and analyze it. What other researchers quickly discovered
was that it was possible to de-anonymize parts of the dataset: compromising the privacy
of students, none of whom were aware their data was being collected (Zimmer 2008).
The case made headlines and raised difficult issues for scholars: what is the status of so-
called ‘public’ data on social media sites? Can it simply be used, without requesting
permission? What constitutes best ethical practice for researchers? Privacy campaigners
already see this as a key battleground where better privacy protections are needed. The
difficulty is that privacy breaches are hard to make specific – is there damage done at the
time? What about twenty years hence? ‘Any data on human subjects inevitably raise
privacy issues, and the real risks of abuse of such data are difficult to quantify’ (Nature,
cited in Berry 2010).
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
19
Institutional Review Boards (IRBs) – and other research ethics committees – emerged in
the 1970s to oversee research on human subjects. While unquestionably problematic in
implementation (Schrag 2010), the goal of IRBs is to provide a framework for evaluating
the ethics of a particular line of research inquiry and to make certain that checks and
balances are put into place to protect subjects. Practices like ‘informed consent’ and
protecting the privacy of informants are intended to empower participants in light of
earlier abuses in the medical and social sciences (Blass 2004; Reverby 2009). Although
IRBs cannot always predict the harm of a particular study – and, all too often, prevent
researchers from doing research on grounds other than ethics – their value is in prompting
researchers to think critically about the ethics of their project.
Very little is understood about the ethical implications underpinning the Big Data
phenomenon. Should someone be included as a part of a large aggregate of data? What
if someone’s ‘public’ blog post is taken out of context and analyzed in a way that the
author never imagined? What does it mean for someone to be spotlighted or to be
analyzed without knowing it? Who is responsible for making certain that individuals and
communities are not hurt by the research process? What does informed consent look like?
It may be unreasonable to ask researchers to obtain consent from every person who posts
a tweet, but it is problematic for researchers to justify their actions as ethical simply
because the data is accessible. Just because content is publicly accessible doesn’t mean
that it was meant to be consumed by just anyone. There are serious issues involved in the
20
ethics of online data collection and analysis (Ess 2002). The process of evaluating the
research ethics cannot be ignored simply because the data is seemingly public.
Researchers must keep asking themselves – and their colleagues – about the ethics of
their data collection, analysis, and publication.
In order to act ethically, it is important that researchers reflect on the importance of
accountability: both to the field of research and to the research subjects. Accountability
here is used as a broader concept than privacy, as Troshynski et al. (2008) have outlined,
where the concept of accountability can apply even when conventional expectations of
privacy aren’t in question. Instead, accountability is a multi-directional relationship: there
may be accountability to superiors, to colleagues, to participants and to the public
(Dourish & Bell 2011). Academic scholars are held to specific professional standards
when working with human participants in order to protect informants’ rights and well-
being. However, many ethics boards do not understand the processes of mining and
anonymizing Big Data, let alone the errors that can cause data to become personally
identifiable. Accountability requires rigorous thinking about the ramifications of Big
Data, rather than assuming that ethics boards will necessarily do the work of ensuring
people are protected.
There are also significant questions of truth, control and power in Big Data studies:
researchers have the tools and the access, while social media users as a whole do not.
Their data was created in highly context-sensitive spaces, and it is entirely possible that
some users would not give permission for their data to be used elsewhere. Many are not
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
21
aware of the multiplicity of agents and algorithms currently gathering and storing their
data for future use. Researchers are rarely in a user’s imagined audience. Users are not
necessarily aware of all the multiple uses, profits and other gains that come from
information they have posted. Data may be public (or semi-public) but this does not
simplistically equate with full permission being given for all uses. Big Data researchers
rarely acknowledge that there is a considerable difference between being in public (i.e.,
sitting in a park) and being public (i.e., actively courting attention) (boyd & Marwick
2011).
6. Limited Access to Big Data Creates New Digital Divides
In an essay on Big Data, Scott Golder (2010) quotes sociologist George Homans
(1974): ‘The methods of social science are dear in time and money and getting dearer
every day.’ Historically speaking, collecting data has been hard, time consuming, and
resource intensive. Much of the enthusiasm surrounding Big Data stems from the
perception that it offers easy access to massive amounts of data.
But who gets access? For what purposes? In what contexts? And with what constraints?
While the explosion of research using data sets from social media sources would suggest
that access is straightforward, it is anything but. As Lev Manovich (2011) points out,
‘only social media companies have access to really large social data - especially
transactional data. An anthropologist working for Facebook or a sociologist working for
22
Google will have access to data that the rest of the scholarly community will not.’ Some
companies restrict access to their data entirely; other sell the privilege of access for a fee;
and others offer small data sets to university-based researchers. This produces
considerable unevenness in the system: those with money – or those inside the company
– can produce a different type of research than those outside. Those without access can
neither reproduce nor evaluate the methodological claims of those who have privileged
access.
It is also important to recognize that the class of the Big Data rich is reinforced through
the university system: top-tier, well-resourced universities will be able to buy access to
data, and students from the top universities are the ones most likely to be invited to work
within large social media companies. Those from the periphery are less likely to get those
invitations and develop their skills. The result is that the divisions between scholars will
widen significantly.
In addition to questions of access, there are questions of skills. Wrangling APIs, scraping
and analyzing big swathes of data is a skill set generally restricted to those with a
computational background. When computational skills are positioned as the most
valuable, questions emerge over who is advantaged and who is disadvantaged in such a
context. This, in its own way, sets up new hierarchies around ‘who can read the
numbers’, rather than recognizing that computer scientists and social scientists both have
valuable perspectives to offer. Significantly, this is also a gendered division. Most
researchers who have computational skills at the present moment are male and, as
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
23
feminist historians and philosophers of science have demonstrated, who is asking the
questions determines which questions are asked (Forsythe 2001; Harding 1989). There
are complex questions about what kinds of research skills are valued in the future and
how those skills are taught. How can students be educated so that they are equally
comfortable with algorithms and data analysis as well as with social analysis and theory?
Finally, the difficulty and expense of gaining access to Big Data produces a restricted
culture of research findings. Large data companies have no responsibility to make their
data available, and they have total control over who gets to see it. Big Data researchers
with access to proprietary data sets are less likely to choose questions that are contentious
to a social media company if they think it may result in their access being cut. The
chilling effects on the kinds of research questions that can be asked - in public or private -
are something we all need to consider when assessing the future of Big Data.
The current ecosystem around Big Data creates a new kind of digital divide: the Big Data
rich and the Big Data poor. Some company researchers have even gone so far as to
suggest that academics shouldn’t bother studying social media data sets - Jimmy Lin, a
professor on industrial sabbatical at Twitter argued that academics should not engage in
research that industry 'can do better' (see Conover 2011). Such explicit efforts to
demarcate research ‘insiders’ and ‘outsiders’ – while by no means new – undermine the
research community. ‘Effective democratisation can always be measured by this
24
essential criterion,’ Derrida claimed, ‘the participation in and access to the archive, its
constitution, and its interpretation’ (1996, p. 4).
Whenever inequalities are explicitly written into the system, they produce class-based
structures. Manovich writes of three classes of people in the realm of Big Data: ‘those
who create data (both consciously and by leaving digital footprints), those who have the
means to collect it, and those who have expertise to analyze it’ (2011). We know that the
last group is the smallest, and the most privileged: they are also the ones who get to
determine the rules about how Big Data will be used, and who gets to participate. While
institutional inequalities may be a forgone conclusion in academia, they should
nevertheless be examined and questioned. They produce a bias in the data and the types
of research that emerge.
By arguing that the Big Data phenomenon is implicated in some broad historical and
philosophical shifts is not to suggest it is solely accountable; the academy is by no means
the sole driver behind the computational turn. There is a deep government and industrial
drive toward gathering and extracting maximal value from data, be it information that
will lead to more targeted advertising, product design, traffic planning, or criminal
policing. But we do think there are serious and wide-ranging implications for the
operationalization of Big Data, and what it will mean for future research agendas. As
Lucy Suchman (2011) observes, via Levi Strauss, ‘we are our tools.’ We should consider
how the tools participate in shaping the world with us as we use them. The era of Big
Data has only just begun, but it is already important that we start questioning the
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
25
assumptions, values, and biases of this new wave of research. As scholars who are
invested in the production of knowledge, such interrogations are an essential component
of what we do.
Acknowledgements
We wish to thank Heather Casteel for her help in preparing this article. We are also
deeply grateful to Eytan Adar, Tarleton Gillespie, Bernie Hogan, Mor Naaman, Jussi
Parikka, Christian Sandvig, and all the members of the Microsoft Research Social Media
Collective for inspiring conversations, suggestions, and feedback. We are indebted to all
who provided feedback at the Oxford Internet Institute’s 10th Anniversary. Finally, we
appreciate the anonymous reviewers’ helpful comments.
References
Acquisti, A. & Gross, R. (2009) ‘Predicting Social Security Numbers from Public Data’,
Proceedings of the National Academy of Science, vol. 106, no. 27, pp. 10975-10980.
Anderson, C. (2008) ‘The End of Theory, Will the Data Deluge Makes the Scientific
Method Obsolete?’, Edge. [online] Available at:
http://www.edge.org/3rd_culture/anderson08/anderson08_index.html (25 July 2011).
Anderson, M. (1988) The American Census: A Social History. Yale University Press,
New Haven, Conn.
Baca, G. (2004) ‘Legends of Fordism: Between Myth, History, and Foregone
Conclusions’, Social Analysis, vol. 48, no. 3, pp. 169-178.
26
Barry, A. and Born, G. (2012) Interdisciplinarity: reconfigurations of the
Social and Natural Sciences. Taylor and Francis, London.
Behar, R. and Gordon, D. A., eds. (1996) Women Writing Culture. University of
California Press, Berkeley, California.
Berry, D. (2011) ‘The Computational Turn: Thinking About the Digital Humanities’,
Culture Machine. vol 12. [online] Available at:
http://www.culturemachine.net/index.php/cm/article/view/440/470 (11 July 2011).
Blass, T. (2004) The Man Who Shocked the World: The Life and Legacy of Stanley
Milgram. Basic Books, New York, New York.
Bollier, D. (2010) ‘The promise and peril of big data’, [Online] Available at: http://
www.aspeninstitute.org/sites/default/files/content/docs/pubs/The_
Promise_and_Peril_of_Big_Data.pdf (11 July 2011).
Bowker, G. C. (2005) Memory Practices in the Sciences. MIT Press, Cambridge,
Massachusetts.
boyd, d. and Marwick, A. (2011) ‘Social Privacy in Networked Publics: Teens’ Attitudes,
Practices, and Strategies,’ paper given at Oxford Internet Institute. [online] Available at:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1925128 (28 September 2011).
Burkholder, L, ed. (1992) Philosophy and the Computer, Boulder, San Francisco, and
Oxford: Westview Press.
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
27
Cain, M. and Finch, J. (1981) ‘Towards a Rehabilitation of Data’, in Practice and
Progress: British Sociology 1950-1980, eds. P. Abrams, R. Deem, J. Finch, & P. Rock,
George Allen and Unwin, London.
Clifford, J. and Marcus, G. E., eds. (1986) Writing Culture: The Poetics and Politics of
Ethnography. University of California Press, Berkeley, California.
Conover, M. (2011) ‘Jimmy Lin’, Complexity and Social Networks Blog, [online]
Available at:
http://www.iq.harvard.edu/blog/netgov/2011/07/the_international_conference_o.html (09
December 2011).
Crawford, K. (2009) ‘Following you: Disciplines of listening in social media’,
Continuum: Journal of Media & Cultural Studies vol. 23, no. 4, 532-33.
Derrida, J. (1996) Archive Fever: A Freudian Impression. trans. Eric Prenowitz,
University of Chicago Press, Chicago & London.
Dourish, P. & Bell, G. (2011) Divining a Digital Future: Mess and Mythology in Ubiqui-
tous Computing, MIT Press, Cambridge, MA.
Du Gay, P. and Pryke, M. (2002) Cultural Economy: Cultural Analysis and Commercial
Life, Sage, London.
Durkheim, E. (1895/1982) Rules of Sociological Method, The Free Press, New York,
NY.
28
Ess, C. (2002) ‘Ethical decision-making and Internet research: Recommendations from
the aoir ethics working committee’, Association of Internet Researchers, [online]
Available at: http://aoir.org/reports/ethics.pdf (12 September 2011).
Fischer, C. (1982) To Dwell Among Friends: Personal Networks in Town and City.
University of Chicago, Chicago.
Forsythe, D. (2001) Studying Those Who Study Us: An Anthropologist in the World of
Artificial Intelligence, Stanford University Press, Stanford.
Freeman, L. (2006) The Development of Social Network Analysis, Empirical Press,
Vancouver.
Fry, J.P., and E.H. Sibley. (1996) [1974] ‘Evolution of Database Management
Systems,’ Computing Surveys vol 8, no 1.1, pp 7-42. Reprinted in (1996) Great Papers in
Computer Science, ed. L. Laplante, IEEE Press, New York.
Gitelman, L. (2011) Notes for the upcoming collection ‘Raw Data’ is an Oxymoron,
[online] Available at: https://files.nyu.edu/lg91/public/ (23 July 2011).
Golder, S. (2010) ‘Scaling Social Science with Hadoop’, Cloudera Blog, [online]
Available at: http://www.cloudera.com/blog/2010/04/scaling-social-science-with-hadoop/
(June 18 2011).
Golder, S. and Macy M. W. (2011) ‘Diurnal and Seasonal Mood Vary with Work, Sleep
and Daylength Across Diverse Cultures,’ Science vol. 333, pp. 1878-1881.
Granovetter, M. S. (1973) ‘The Strength of Weak Ties,’ American Journal of Sociology
vol. 78, no. 6, pp. 1360-80.
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
29
Harding, S. (2010) ‘Feminism, science and the anti-Enlightenment critiques’, in Women,
Knowledge and Reality: Explorations in Feminist Philosophy, eds A. Garry and M.
Pearsall, Boston: Unwin Hyman, 298–320.
Homans, G.C. (1974) Social Behavior: Its Elementary Forms, Harvard University Press,
Cambridge, MA.
Kranzberg, M. (1986) ‘Technology and History: Kranzberg's Laws’, Technology and
Culture vol. 27, no. 3, pp. 544-560.
Latour, B. (2009). ‘Tarde’s idea of quantification’, in The Social After Gabriel Tarde:
Debates and Assessments, ed M. Candea, London: Routledge, pp. 145-162. [online]
Available at: http://www.bruno-latour.fr/articles/article/116-TARDE-CANDEA.pdf (19
June 2011).
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A., Brewer, D.,Christakis, N.,
Contractor, N., Fowler, J.,Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van
Alstyne, M. (2009). ‘Computational Social Science’. Science vol. 323, pp. 721-3.
Leinweber, D. (2007) ‘Stupid Data Miner Tricks: Overfitting the S&P 500’, The Journal
of Investing, vol. 16, no. 1, pp. 15-22.
Lewis, K., Kaufman, J., Gonzalez, M.,Wimmer, A., & Christakis, N. (2008) ‘Tastes, ties,
and time: A new social network dataset using Facebook.com’, Social Networks vol. 30,
pp. 330-342.
Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I., & boyd, d. (2011). ‘The
Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian
Revolutions,’ International Journal of Communications vol. 5, Feature 1375–1405.
30
Manovich, L. (2011) ‘Trending: The Promises and the Challenges of Big Social Data’,
Debates in the Digital Humanities, ed M.K.Gold. The University of Minnesota Press,
Minneapolis, MN [online] Available at:
http://www.manovich.net/DOCS/Manovich_trending_paper.pdf (15 July 2011).
McCloskey, D. N. (1985) ‘From Methodology to Rhetoric’, In The Rhetoric of
Economics au D. N. McCloskey, University of Wisconsin Press, Madison, pp. 20-35.
Meeder, B., Tam, J., Gage Kelley, P., & Faith Cranor, L. (2010) ‘RT @IWantPrivacy:
Widespread Violation of Privacy Settings in the Twitter Social Network’, Paper
presented at Web 2.0 Security and Privacy, W2SP 2011, Oakland, CA.
Meiss, M.R., Menczer, F., and A. Vespignani. (2008) ‘Structural analysis of behavioral
networks from the Internet’, Journal of Physics A: Mathematical and Theoretical, vol.
41, no. 22, pp. 220-224.
Moretti, F. (2007) Graphs, Maps, Trees: Abstract Models for a Literary History. Verso,
London.
Onnela, J. P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., & Kertész, J.,
Barabási, A.L. (2007) ‘Structure and tie strengths in mobile communication networks’,
Proceedings from the National Academy of Sciences, vol.104, no.18, pp. 7332-7336.
Pariser, E. (2011) The Filter Bubble: What the Internet is Hiding from You. Penguin
Press, New York, NY.
Radcliffe-Brown, A.R. (1940) ‘On Social Structure’, The Journal of the Royal
Anthropological Institute of Great Britain and Ireland vol.70, no.1, pp.1–12.
DRAFT VERSION
boyd, danah and Kate Crawford. (2012). “Critical Questions for Big Data: Provocations for a Cultural,
Technological, and Scholarly Phenomenon.”Information, Communication, & Society 15:5, p. 662-679.
31
Reverby, S. M. (2009) Examining Tuskegee: The Infamous Syphilis Study and Its Legacy.
University of North Carolina Press.
Savage, M. and Burrows, R. (2007) ‘The Coming Crisis of Empirical Sociology’,
Sociology, vol. 41, no. 5, pp. 885-899.
Schrag, Z. M. (2010) Ethical Imperialism: Institutional Review Boards and the Social
Sciences, 1965-2009. Johns Hopkins University Press, Baltimore, Maryland.
Shamma, D.A., Kennedy, L., and Churchill, E.F.. (2010). ‘Tweetgeist: Can the Twitter
Timeline Reveal the Structure of Broadcast Events?,’ CSCW 2010.
Suchman, L. (2011) ‘Consuming Anthropology’, in Interdisicpinarity: Reconfigurations
of the Social and Natural Sciences, eds A. Barry and G. Born, Routledge, London and
New York.
Twitter. (2011) ‘One hundred million voices’, Twitter Blog, [online] Available at:
http://blog.twitter.com/2011/09/one-hundred-million-voices.html (12 September 2011).
Veinot, T. (2007) ‘The Eyes of the Power Company: Workplace Information Practices of
a Vault Inspector’, The Library Quarterly, vol.77, no.2, pp.157-180.
Wang, X. (2011) ‘Twitter Posts Show Workers Worldwide are Stressed out on the Job,’
Bloomberg Businessweek. [online] Available at:
http://www.businessweek.com/news/2011-09-29/twitter-posts-show-workers-worldwide-
are-stressed-out-on-the-job.html (12 March 2012).
Wu, S., Hofman, J. M., Mason, W. A., & Watts, Duncan J. (2011). ‘Who Says What to
Whom on Twitter,’ Proceedings of WWW'11.
32
Zimmer, M. (2008) ‘More on the ‘Anonymity’ of the Facebook Dataset – It’s Harvard
College’, MichaelZimmer.org Blog, [online] Available at:
http://www.michaelzimmer.org/2008/01/03/more-on-the-anonymity-of-the-facebook-
dataset-its-harvard-college/ (20 June 2011).
... Pages 189-194 However, one issue that is under-discussed but fundamental is how AI treats regional languages in the media ecosystem (Rahimi, 2018). AI tends to be trained more on major languages such as English or certain national languages, while minority regional languages often do not receive the same attention (Bender, 2020). This poses a real risk to the sustainability of local languages and cultures (Krauss, 1992). ...
... To function optimally, AI requires large amounts of data, including users' personal information (Zuboff, 2019). This opens the door to potential data misuse, whether in the form of unauthorized collection of information, manipulation of reader behavior, or threats to freedom of speech (boyd & Crawford, 2012). Privacy, which should be an individual right, is now increasingly blurred in an AI-dominated digital ecosystem (Lyon, 2001). ...
... In addition, the lack of representation of regional languages is one of the biggest challenges in AI journalism (Rahimi, 2018). AI models tend to favor widely used languages such as English, making less common regional languages marginalized (Bender, 2020). This linguistic inequality risks threatening the survival of local cultures, as the media becomes increasingly homogeneous under the control of AI systems (Krauss, 1992). ...
Article
The rapid advancement of artificial insights (AI) within the media industry has transformed how individuals interact with data. From news coverage to computerization, information examination, and content creation, AI is increasingly being utilized to enhance the efficiency and quality of news production. Be that as it may, nearby its benefits, developing dependence on AI moreover raises moral concerns, counting algorithmic inclination, information security, and phonetic representation. This thinking embraces a subjective approach with a clear investigation strategy to investigate AI's moral suggestions in media. The discoveries uncover that whereas AI makes strides in efficiency and journalistic exactness, it too presents critical challenges, such as the chance of deception, the need for straightforwardness, and dangers to social and phonetic differing qualities. In this manner, it is vital for the media industry to guarantee mindful AI execution by prioritizing straightforwardness, moral administration, and etymological inclusivity. With a more moral and comprehensive approach, AI can improve the media scene without compromising the central values of news coverage.
... Our position is that we can use LLMs to make human meaning legible at an unprecedented scale; that "thin" metrics alone will not be able to do this, because they cannot adequately represent cultural context; and that developing "thick" representational formats might eventually allow our society's most powerful and impactful sociotechnical systems to support the heterogeneous range of experiences that really matter to people. We argue that such systems have previously been unable to represent human meaning because they typically rely on thin descriptions: numerical representations that enforce standardization, and therefore strip human activity of the broader context that gives it meaning [9,59,71,89]). By contrast, scholars in the humanities and qualitative social sciences have developed robust frameworks for representing meaning based on thick descriptions: verbal representations that allow for heterogeneity, and therefore retain crucial contextual information necessary for analyzing cultural and experiential meaning [27,82,87,108]. ...
... Thus, one promising conceptual framework for making meaning legible at scale is to make thick descriptions legible at scale. However, the default format of representation within most sociotechnical systems is thin; generally speaking, only numerical signals-canonical thin descriptions such as GDP, engagement metrics, or stock price-are legible within the information processing capacities of such systems [9,59,71,89]. In this section, we argue that, across a wide range of sociotechnical systems, the use of thin descriptions strips away the kind of contextual data needed to encode a signal of human meaning. ...
Preprint
Full-text available
This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.
... Other research evaluates interpretation qualities [6][7][8] , while some explore method similarities [9,10] . The field has also become central to discussions about ML bias and fairness [11][12][13] . ...
Article
This paper offers a comprehensive analysis of the interpretability of key Machine Learning models, including ElasticNet regression, Random Forest, and Neural Networks, when faced with various types of noise. Focusing on both synthetic and real-world datasets of diverse sizes (385 to 15,000 samples), the study probes the models' ability to detect hidden patterns, especially in the presence of varied noise conditions (Gaussian, Perlin, and Simplex). Through systematic evaluation using Permutation Feature Importance (PFI) and SHAP summary plots, our research reveals a strong correlation between dataset size and model robustness to noise perturbations. The results demonstrate that larger datasets consistently lead to more stable feature importance rankings and better preservation of model interpretability under noise conditions. While ElasticNet shows superior performance on larger datasets, Neural Networks prove most sensitive to noise, particularly with smaller datasets. The findings provide valuable insights for practical applications of machine learning, suggesting that emphasis should be placed on acquiring larger training datasets to ensure robust and trustworthy model interpretations in noisy environments. This work contributes to the broader understanding of ML model interpretability and provides guidance for model selection in real-world applications where data noise is inevitable.
... This chapter will discuss ethical tensions and related social media paradoxes in relation to social media based publicity appeals for missing persons. Ethical tensions are defined as conflicts or dilemmas that arise when there are competing moral claims or principles that are difficult to reconcile in cyberspace (Boyd & Crawford, 2012;Zimmer, 2010;Markham & Buchanan, 2012). The ethical tensions present within this thesis concern individual privacy and the public's and stakeholder's interest in disseminating information on missing persons widely. ...
Thesis
The overarching aim of this PhD thesis was to explore the impact and effectiveness of social media-based publicity appeals for missing persons through the exploration of social media bias, the role of platforms and technologies and user motivations to engage and share missing persons appeals. This thesis applies innovative methodologies such as manual and automatic web scraping techniques and provides a thorough methodological discussion regarding such approaches of collecting non-obtrusive data and digital methodologies. Ethical issues are discussed in consideration for data collection techniques. Study 1 aimed to build a demographic profile of missing individuals who receive publicity appeals on social media and to assess platform differences between Facebook and Twitter in relation engagement levels. The results highlight key platform differences and how the public engages with appeals on Facebook and Twitter. Study 2 aimed to further explore appeals posted on Twitter using a longitudinal approach to assess changes in network structures of Twitter accounts engaging with the Missing People Charity and explore the impact of missing persons characteristics on engagement levels. Results highlight key practical implications of managing appeals on Twitter to promote privacy and further safeguarding of the missing individuals. Study 3 explored user motivations through the understanding of social media-based affordances and propensity to
... A growing body of literature in the past two decades has focused on the myriad ways in which uncritical faith in datasets, algorithms, and AI technologies has performative effects that shape decision-making. This is commonly based on the assumption that quantifying techniques, given voluminous amounts of data, can lead observers closer to a view from nowhere (Baumgartner et al., 2023;Boyd & Crawford, 2012;Gustavsson & Ljungberg, 2021;Striphas, 2015). The emergence of data-and algorithm-mediated life, this literature helps reveal, has been spurred by imaginaries of quantified selves and datafied ecologies, typically with a nod toward managerialist improvements of spaces, places, and the bodies that inhabit them. ...
Article
Full-text available
Synthetic data - artificially produced data used for various data science tasks - have become the subject of intense scholarly interest, engendering both hope and hype in fields like machine learning (ML) and data privacy. In this commentary, we shed light on a little-studied facet of the emerging synthetic data landscape: their evaluation through the use of different quality measures, such as privacy, utility, and fidelity metrics. While these may seem highly technical, this commentary argues that evaluation metrics are inextricably linked to the expectations, ethics and politics of synthetic data. Situating synthetic data metrics within longer histories of data measurement in big data and ML discourses, we unfold a conceptualization of synthetic data metrics as metrological regimes which highlights the multifaceted ways in which they are implicitly and explicitly political. We put this concept to use by providing a three-fold preliminary analysis of metrics for the evaluation of synthetic tabular data: first, we outline the current constitution of synthetic data’s metrological regimes around utility, privacy, and fidelity metrics; second, we highlight the performativity of these metrological regimes; that is, how they overshadow other crucial measures and enact quantifications of essentially contested concepts; and third, we emphasize the fragility of synthetic data’s metrological regimes by pointing to the eruption of specific negotiations regarding which privacy metrics (not) to use for synthetic data evaluation. By foregrounding how metrics shape the expectations, ethics, and politics of synthetic data, this commentary underlines the need for their critical study.
Article
Full-text available
This paper presents a hybrid framework for dynamic incident prioritization in enterprise environments, combining fuzzy logic, natural language processing, and reinforcement learning. The proposed system models incident descriptions through semantic embeddings derived from advanced text analytics, which serve as state representations within a fuzzy Q-learning model. Severity and urgency are encoded as fuzzy variables, enabling the prioritization process to manage linguistic vagueness and operational uncertainty. A mathematical formulation of the fuzzy Q-learning algorithm is developed, including fuzzy state definition, reward function design, and convergence analysis. The system continuously updates its prioritization policy based on real-time feedback, adapting to evolving patterns in incident reports and resolution outcomes. Experimental evaluation on a dataset of 10,000 annotated incident descriptions demonstrates improved prioritization accuracy, particularly for ambiguous or borderline cases, and reveals a 19% performance gain over static fuzzy and deep learning-based baselines. The results validate the effectiveness of integrating fuzzy inference and reinforcement learning in incident management tasks requiring adaptability, transparency, and mathematical robustness.
Article
Full-text available
This review article critically examines the ethical dimension of implementing emerging technologies in educational settings, with a focus on artificial intelligence, big data, and digital monitoring systems. Based on the analysis of 40 specialized sources, five thematic axes are explored: data privacy, algorithmic bias, institutional governance, digital surveillance, and participatory ethical design. The findings reveal a concerning trend towards the normalization of practices that undermine the autonomy, equity, and privacy of students and teachers. In addition, regulatory gaps and the lack of digital ethical literacy in technology adoption processes are identified. In response, ethics-by-design frameworks, progressive consent policies, and participatory evaluation are proposed as strategies to mitigate identified risks. The review argues that ethics should not be seen as a post-innovation add-on but rather as a guiding principle from the outset of educational technology development and implementation. The article offers conceptual and practical foundations for advancing towards a more just, inclusive, and human-centered digital education.
Article
This research project explores the intersection of social media analytics, influencer marketing, and the dissemination of misinformation in the e-commerce sector. As social platforms become increasingly central to product promotion and consumer engagement, understanding both the potential and pitfalls of this digital landscape is critical. This study aims to address two primary concerns: evaluating the effectiveness of influencer marketing strategies and examining how misinformation about e-commerce products spreads across social media. This research will employ a mixed-methods approach, primarily focusing on the analysis of existing social media datasets related to e-commerce brands and influencer marketing campaigns. The research utilizes social media datasets to analyze engagement metrics (likes, shares, comments), user network structures, and textual content. Network analysis will be employed to understand the diffusion pathways of fake news. Statistical modeling will be used to determine the correlation between engagement metrics and the virality of misinformation Simultaneously, influencer content formats—such as reviews, unboxings, and testimonials—are assessed in terms of their impact on audience engagement and consumer conversion behavior. Key findings are anticipated to reveal strong correlations between specific engagement metrics and both influencer marketing success and misinformation propagation. The research is expected to identify which types of content and influencer strategies drive higher return on investment (ROI) for e-commerce brands. It also seeks to establish frameworks for the early detection and mitigation of fake news through predictive modeling. This study contributes actionable insights for multiple stakeholders. E-commerce brands will benefit from recommendations to enhance influencer collaborations and protect brand integrity. Social media platforms will gain guidance on refining content moderation systems. Consumers will be empowered to make more informed decisions, while influencers and marketers will receive best practices for building trust and maximizing campaign impact. Overall, this project aims to strengthen the digital marketing ecosystem through evidence-based analysis and strategic recommendations.
Article
Full-text available
In today's volatile business environment, competitive analysis has evolved from a periodic strategic exercise into a core driver of profitability. This study examines how firms across industries leverage competitive intelligence-from traditional frameworks to AI-powered analytics-to enhance financial performance. Our analysis of 200 enterprises reveals that organizations adopting AI-enhanced competitive intelligence achieve 18-22% higher profitability than peers, demonstrating the tangible value of advanced strategic insights. However, sector-specific nuances emerge: while technology and healthcare firms lead in AI adoption, traditional industries still benefit from conventional approaches like SWOT analysis, particularly in stable markets. The research also uncovers a critical paradox-25% of firms experience diminished strategic judgment due to over-reliance on data analytics, a phenomenon termed "algorithmic myopia." Case studies illustrate successful implementations: Target's 145% e-commerce growth resulted from blending AI with traditional analysis, while Moderna's 73% gross margins reflect the power of patent analytics in R&D strategy. These findings highlight the need for balanced approaches that combine technological capabilities with human expertise. For practitioners, the study provides a roadmap for developing adaptive competitive intelligence frameworks tailored to industry dynamics. For policymakers, it underscores the urgency of ethical guidelines for AI in competitive strategy. The research advances theory by demonstrating how hybrid analytical models-merging classical rigor with digital innovation-create sustainable advantages. Ultimately, this work redefines competitive analysis as both science and strategic art, offering critical insights for navigating today's complex business landscape.
Chapter
The idea of discipline opens up a nexus of meaning. Disciplines discipline disciples.1 A commitment to a discipline is a way of ensuring that certain disciplinary methods and concepts are used rigorously and that undisciplined and undisciplinary objects, methods and concepts are ruled out. By contrast, ideas of interdisciplinarity imply a variety of boundary transgressions, in which the disciplinary and disciplining rules, trainings and subjectivities given by existing knowledge corpuses are put aside. In this introduction we interrogate the current preoccupation with interdisciplinarity and transdisciplinarity, in particular the ascendance in recent years of a particular discourse on interdisciplinarity where it is associated with a more generalised transformation in the relations between science, technology and society. We are therefore less concerned with interdisciplinarity in general than with the contemporary formation of interdisciplinarity: how it has come to be seen as a solution to a series of current problems, in particular the relations between science and society, the development of accountability, and the need to foster innovation in the knowledge economy. The present situation, we will suggest, can be understood as a problematisation: 2 the question of whether a given knowledge practice is too disciplinary, or interdisciplinary, or not disciplinary enough has become an issue and an object of enquiry for governments, funding agencies and researchers.
Article
The term Big Data is applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. This chapter addresses some of the theoretical and practical issues raised by the possibility of using massive amounts of social and cultural data in the humanities and social sciences. These observations are based on the author’s own experience working since 2007 with large cultural data sets at the Software Studies Initiative at the University of California, San Diego. The issues discussed include the differences between ‘deep data’ about a few people and ‘surface data’ about many people; getting access to transactional data; and the new “data analysis divide” between data experts and researchers without training in computer science.
Article
University researchers in the United States seeking to observe, survey, or interview people are required first to complete ethical training courses and to submit their proposals to an institutional review board (IRB). Under current rules, IRBs have the power to deny funding, degrees, or promotion if their recommended modifications to scholars' proposals are not followed. This volume explains how this system of regulation arose and discusses its chilling effects on research in the social sciences and humanities. Zachary M. Schrag draws on original research and interviews with the key shapers of the institutional review board regime to raise important points about the effect of the IRB process on scholarship. He explores the origins and the application of these regulations and analyzes how the rules-initially crafted to protect the health and privacy of the human subjects of medical experiments-can limit even casual scholarly interactions such as a humanist interviewing a poet about his or her writing. In assessing the issue, Schrag argues that biomedical researchers and bioethicists repeatedly excluded social scientists from rule making and ignored the existing ethical traditions in nonmedical fields. Ultimately, he contends, IRBs not only threaten to polarize medical and social scientists, they also create an atmosphere wherein certain types of academics can impede and even silence others. The first work to document the troubled emergence of today's system of regulating scholarly research, Ethical Imperialism illuminates the problems caused by simple, universal rule making in academic and professional research. This short, smart analysis will engage scholars across academia. © 2010 by The Johns Hopkins University Press. All rights reserved.