ArticlePDF Available

The Cultural Environment: Measuring Culture With Big Data

Authors:

Abstract

The rise of the Internet, social media, and digitized historical archives has produced a colossal amount of text-based data in recent years. While computer scientists have produced powerful new tools for automated analyses of such "big data," they lack the theoretical direction necessary to extract meaning from them. Meanwhile, cultural sociologists have produced sophisticated theories of the social origins of meaning, but lack the methodological capacity to explore them beyond micro-levels of analysis. I propose a synthesis of these two fields that adjoins conventional qualitative methods and new techniques for automated analysis of large amounts of text in iterative fashion. First, I explain how automated text extraction methods may be used to map the contours of cultural environments. Second, I discuss the potential of automated text-classification methods to classify different types of culture such as frames, schema, or symbolic boundaries. Finally, I explain how these new tools can be combined with conventional qualitative methods to trace the evolution of such cultural elements over time. While my assessment of the integration of big data and cultural sociology is optimistic, my conclusion highlights several challenges in implementing this agenda. These include a lack of information about the social context in which texts are produced, the construction of reliable coding schemes that can be automated algorithmically, and the relatively high entry costs for cultural sociologists who wish to develop the technical expertise currently necessary to work with big data.
The cultural environment: measuring culture with big
data
Christopher A. Bail
#Springer Science+Business Media Dordrecht 2014
Abstract The rise of the Internet, social media, and digitized historical archives has
produced a colossal amount of text-based data in recent years. While computer scientists
have produced powerful new tools for automated analyses of such big data,they lack the
theoretical direction necessary to extract meaning from them. Meanwhile, cultural sociolo-
gists have produced sophisticated theories of the social origins of meaning, but lack the
methodological capacity to explore them beyond micro-levels of analysis. I propose a
synthesis of these two fields that adjoins conventional qualitative methods and new
techniques for automated analysis of large amounts of text in iterative fashion. First, I
explain how automated text extraction methods may be used to map the contours of cultural
environments. Second, I discuss the potential of automated text-classification methods to
classify different types of culture such as frames, schema, or symbolic boundaries. Finally, I
explain how these new tools can be combined with conventional qualitative methods to trace
the evolution of such cultural elements over time. While my assessment of the integration of
big data and cultural sociology is optimistic, my conclusion highlights several challenges in
implementing this agenda. These include a lack of information about the social context in
which texts are produced, the construction of reliable coding schemes that can be automated
algorithmically, and the relatively high entry costs for cultural sociologists who wish to
develop the technical expertise currently necessary to work with big data.
Keywords Culture .Content analysis .Mixed-methods .Evolutionary theory
More data were accumulated in 2002 than all previous years of human history
combined.
1
By 2011, the amount of data collected prior to 2002 was being collected
every 2 days.
2
This dramatic growth in data spans nearly every part of our livesfrom
Theor Soc
DOI 10.1007/s11186-014-9216-5
1
International Data Corporation, The 2011 Digital Universe Study: Extracting Value from Chaos,June,
2011. See also Christopher R. Johnson, How Big is Big Data?Lecture at the University of Michigans
Cyber-Infrastructure Conference, November 7th, 2012.
2
Ibid.
C. A. Bail (*)
University of North Carolina at Chapel Hill, 225 Hamilton Hall, Chapel Hill, NC 27599, USA
e-mail: christopherandrewbail@gmail.com
gene sequencing to consumer behavior.
3
While most of these data are binary or
quantitative, text-based data are also being accumulated on an unprecedented scale. In
an era of social science research plagued by declining survey response rates and concerns
about the generalizability of qualitative research, these data hold considerable potential
(Golder and Macy 2011;King2011; Lazer et al. 2009). Yet social scientistsand cultural
sociologists in particularhave largely ignored the promise of so-called big data.
Instead, cultural sociologists have left this wellspring of information about the arguments,
worldviews, or values of hundreds of millions of people from Internet sites and other
digitized texts to computer scientists who possess the technological expertise to extract
and manage such data but lack the theoretical direction to interpret their meaning.
The most obvious explosion in text-based data coincided with the rise of the
Internet. Between 1995 and 2008 the number of websites expanded by a factor of
more than 66 million, recently surpassing 1 trillion.
4
Although sociologists were
understandably concerned about digital divides in years past, these inequalities appear
to be steadily decreasing (DiMaggio and Bonikowski 2008;Dimaggioetal.2001).
According to a 2012 survey, roughly half of all Americans visit a social media site such
as Facebook or Twitter each day, producing billions of lines of text in so doing.
5
These
trends are markedly higher among younger people, suggesting these trends may only
continue to grow over time.
6
Most of the text from social media sites is readily
accessible via simple computer programs.
7
Yet the outgrowth in text-based data on
the Internet is not limited to social media sites. Screen-scraping technologies can be
used to extract information from any number of Internet sites within time frames that
are only limited by digital storage capacity.
8
And the potential to collect such data is not
only tied to the future, but also the past. Since 1996, a non-profit organization known as
the Internet Archive has been storing all text from nearly every website on the Internet.
The outgrowth of text-based data is also not confined to the Internet. Thanks to new
digital technologies from fields as diverse as library science and communications, an
unprecedented amount of qualitative data is being archived. Google alone has already
created digital copies of nearly every single book ever written in collaboration with more
than 19 million libraries worldwide.
9
Academic data warehouses such as LEXIS-NEXIS or
ProQuest now contain digital copies of most of the worlds journals, newspapers, and
magazines. The Vanderbilt Television News Archive contains copies of most major
3
The US National Science Foundation invested more than $15 million in Big Data projects in 2012, and will
easily surpass this amount in upcoming years due to the development of new infrastructure for funding big
data projects in collaboration with Britains Economic & Social Research Council, the Netherlands Organi-
zation for Scientific Research, and the Canada Foundation for Innovation, among many others.
4
Jesse Alpert and Nissan Hajaj, We knew the web was big…” Official Google Blog, July 25th, 2008 (http://
googleblog.blogspot.com/2008/07/we-knew-web-was-big.html accessed January 2012).
5
Pew Internet & American Life Project, February 1st, 2012.
6
Social Networking Popular Across Globe,Pew Research Global Attitudes Project, December 12, 2012.
7
Moreover, the US Library of Congress recently announced plans to release a database of every single Twitter
message ever made. Current estimates place the total number of tweets that might be archived at more than
170 billion.
8
Web-scraping technologies have facilitated the collection of remarkably large datasets. Golder and Macy
(2011), for example, recently conducted a study of more than 500 million Twitter messages produced in more
than 84 countries over a 2 year period.
9
Though access to the entire Google book archive is limited by pay walls designed to protect copyright
privileges, Google has released the entire dataset in ngramformat, which allows scholars to analyze them
via the automated text analysis tools discussed in further detail below.
Theor Soc
newscasts produced since 1998. An unprecedented amount of text-based data that describe
legislative debates, government reports, and other state discourse is also now available on
websites such as the National Archives of the United States and Great Britain. Qualitative
academic research is also being compiled within meta-dataarchives on an unprecedented
scalefrom in-depth interview data to field notes.
10
Continuing improvement in digital
speech recognition technologies has also facilitated even more text-based data from histor-
ical audio sources to local town hall meetings that are recorded and uploaded to websites for
posterity. Indeed, the remarkable growth in text-based data warrants a brief thought exper-
iment: what types of text or speech-based data are not currently being archived?
If the answer is littleor very littletext is not being archived, then cultural sociology
must have a reckoning with big data alongside those in other fields.
11
Political scientists
are currently exploring the potential of social media to explain political mobilization
(Hopkins and King 2010; Livne et al. 2011). Public health scholars use Twitter to identify
trends in disease (Paul and Dredze 2011), and communications scholars claim it can be
used to predict shifts within the stock market (Bollen et al. 2011). Even humanities
scholars have invented the vibrant new field of digital humanities (e.g. Gold 2012;
Moretti 2013; Tangherlini and Leonard 2013). By comparison, cultural sociologists have
made very few ventures into the universe of big data even thoughtexts are a central object
of study in the fieldin the form of primary documents, interview transcriptions, or field
notes.
12
In this article, I argue inattention to big data among cultural sociologists is doubly
surprising since it is naturally occurringunlike survey research or cross-sectional
qualitative interviewsand therefore critical to understanding the evolution of meaning
structures in situ. That is, many archived texts are the product of conversations between
individuals, groups, or organizations instead of responses to questions created by re-
searchers who usually have only post-hoc intuition about the relevant factors in meaning-
makingmuch less how cultural evolves in real time.
13
For all the promise of big data for cultural sociology, formidable obstacles remain.
First of all, the sheer volume of data can be overwhelming. Large corpora cannot be
coded by hand, and automated data mining techniques are of little utility if they are not
guided by theory. Second, big data is untidy. Although computer-assisted data classi-
fication and data reduction techniques have improved in the past decade, much big data
analysis remains computationally intensive and therefore out of reach for many cultural
sociologistsparticularly those without any background in statistics or computer
programming. Thirdand perhaps most importantlythere is much that is of interest
to cultural sociologists that is not easily reducible to text. The greatest challenge for
cultural sociologists interested in big data is to develop new techniques to measure the
unspoken or implicit meanings that occur in-between words. The preconscious cultural
scripts or frames that shape how people understand the world (e.g., DiMaggio 1997),
10
See, for example, the Dataverse Network, the Interdisciplinary Consortium for Political and Social
Research, and the United Kingdoms Qualidata archive.
11
The neologism big datahas come to refer to many different types of data. Here, I use the term to refer to
the increasingly large volume of text-based data that is often-though not always- produced through digital
sources. As the remainder of this manuscript describes, these data are also unique because they are naturally
occurring,unlike survey data which result from the intrusion of researchers into everyday life.
12
Exceptions described in additional detail below include Franzosi (2004), Lewis et al. (2008), Bail (2012),
Bail (forthcoming) and several other works in progress.
13
Real timerefers to the collection, presentation, or analysis of data at or very near the time it is being
produced by social actors.
Theor Soc
for example, are not always manifest in speech or text. Similarly, most big data eschews
the production of meaning through bodily interaction (e.g., Eliasoph and Lichterman
2003)though the future of big data may include new techniques to analyze the ever-
increasing volume of video on the Internet (Collins 2013; Lan and Raptis 2013).
This article does not offer solutions to each of these limitations of big data. Instead, it
provides a critical survey of recent developments within the big data movement and
links them to outstanding theoretical debates and measurement challenges within
cultural sociology. These include the measurement of cultural environments or meaning
systems such as discursive fields; the classification of cultural elements such as frames
or schema within such systems; and tracing cultural processes over long segments of
time. In describing the promise of big data for cultural sociology, I also detail how the
latter field may address some of the most vexing challenges of the former given its
foundational interest in the systematic study of meaning. I provide only limited
discussion of the technical and logistical issues that arise in working with big data
because these issues are currently being addressed within separate literatures referenced
below.
14
I also do not review the promising field of quantitative narrative analysis
because it has been addressed elsewhere.
15
This article is thus an invitation to cultural
sociologists curious about the potential of big data and a call to shatter the disciplinary
silos that inhibit collaboration between this field and those who lead the big data
movement.
Mapping cultural environments
By and large, the central objects of study in cultural sociology have been confined to
micro-levels of analysis. For example, cultural elements such as symbolic boundaries
(e.g., Lamont 1992), cultural toolkits (e.g., Swidler 1986), cognitive schemas (e.g.,
DiMaggio 1997), and cultural frames (e.g., Benford and Snow 2003) have been defined
as judgments, classifications, or pre-conscious decisions that can only be measured
through close readings of texts such as interview transcripts, content analysis of key
texts, or ethnographic field notes. Yet as Swidler (1995)argues,the greatest unan-
swered question in the sociology of culture is whether and how some cultural elements
control, anchor, or organize others.
16
For example, how are cultural frames ordered
within vast discursive fields? Is there a space between such fields? How do cultural
frames shape the evolution of fields more broadly? Addressing such questions requires
meso- and macro-level analysis of the relationship between multiple cultural elements
or systems of meaning. One of the most promising dimensions of the big data
movement for cultural sociology is to enable new analyses at these larger levels of
analysis. As I describe below, one can now obtain every website, blog, social media
message, newspaper article, or television transcript on a given topic fairly easily.
The capacity to capture allor nearly allrelevant text on a given topic opens
exciting new lines of meso- and macro-level inquiry into what I call cultural
14
For a technical overview of techniques designed for analysis of Big Data, see Manning and Schuetze
(1999).
15
For an overview, see Franzosi (2009).
16
See also Ghaziani and Baldassarri (2011).
Theor Soc
environments (Bail forthcoming). Ecological or functionalist interpretations of culture
have been unpopular with cultural sociologists for some timemost likely because the
subfield defined itself as an alternative to the general theory proposed by Talcott Parsons
(Alexander 2006). Yet many cultural sociologists also draw inspiration from Mary
Douglas (e.g., Alexander 2006;Lamont1992; Zelizer 1985), wholike Swidlerinsists
upon the need for our subfield to engage broader levels of analysis. For sociology to
accept that no functionalist arguments work,writes Douglas (1986,p.43),is like cutting
off ones nose to spite ones face.To be fair, cultural sociologists have recently made
several programmatic statements about the need to engage functional or ecological
theories of culture. Abbott (1995), for example, explains the formation of boundaries
between professional fields as the result of an evolutionary process. Similarly, Lieberson
(2000), presents an ecological model of fashion trends in child-naming practices. In a
review essay, Kaufman (2004) describes such ecological approaches to cultural sociology
as one of the three most promising directions for the future of the subfield.
17
The concept of discursive fields is perhaps the most promising theoretical construct
to advance an ecological approach to cultural sociology (Bourdieu 1975; Foucault
1970;Martin2003; Wuthnow 1993). Yet field theory is often castigated for being
tautological, or assuming the existence of invisible or intangible social forces that
reproduce structures of inequality or patterns of cultural differentiation without ever
directly observing them. The boundaries of fields are usually unobserved in empirical
studies because of the considerable methodological obstacles involved. Apart from
Eyal (2009), cultural sociologists have scarcely theorized the outer limits of cultural
fields, the spaces between them, or the relationships among multiple fields.
18
This is a
significant limitation since most field theory makes several assumptions that are
inherently ecological. For example, many studies assume that relationships between
actors or groups of actors within a field produce a polarity that sustains or reproduces
uneven power relationships or access to institutions (Bourdieu 1985; Fligstein and
McAdam 2011; Wuthnow 1993). Others borrow more directly from ecological or
evolutionary theory to explain the competition for attention or resources within fields
(Abbott 2001;Kaufman2004;Lieberson2000), or the ability of cultural entrepreneurs
to exploit niches within such environments (e.g., Mische 2008).
Despite the implicit ecological reasoning of field theory, most applications of this
framework rely upon micro- or meso-level measurement strategies. For example, many
studies identify key actors or institutions within fields and trace their influence over
other parts of the fields. Other studies focus upon conflict or classification struggles
within fields in order to identify such influential actors (Bourdieu 1990). As a result,
these types of studies only observe the consequences of field-level processes rather than
meso- or macro-level relationships between social actors and cultural elements that
most scholars believe create such social spaces.
19
These micro-level measurement
strategies are typically necessary because most discursive fields are so broad that an
entire team of researchers working for several years could only map a fraction of all the
texts, transcripts, or archives that define them. The size of most cultural fields has
become even more daunting with the rise of the Internet. Indeed, a researcher could
17
See also Mark (2003).
18
One exception is Evans and Kays(2008) study of field overlap.
19
Exceptions include Mohr and Guerra-Pearson (2010)andBail(2012).
Theor Soc
easily follow links between websites for hours only to forget where, when, or why they
shifted focus from one site to another.
The big data movement has made extracting all text from a discursive field easier
than ever before.
20
Massive databases already exist that classify texts into meaningful
social categories. For example, services such as LEXIS-NEXIS and Pro-Quest have
sophisticated searchable indexes that cover industries, geographical location, time, or
different types of text (e.g., newspapers, newswires, or television transcripts). Simple
Boolean operates such as ANDand ORcan be used to further specify meaningful
cultural environments within each of these sub-samples.
21
Yet perhaps the most
powerful innovation of the big data movement for the mapping of cultural environ-
ments has been screen scraping, or automated extraction of text from websites. Screen
scraping is typically used to mine text or other data from web pages, though it can also
be used to extract text from scanned images using Optical Character Recognition
(OCR) technologies. A variety of data archives have developed searchable indexes
based on such screen-scraping technologies. Google, for example, allows Boolean
searches of its archives of books, blogs, government documents, and major US
newspapers and magazines.
But new technologies produced by the big data movement have also advanced
automated extraction of text far beyond simple indexes, Boolean searches, and screen
scraping. In particular, new techniques have been developed to exploit the relational
nature of many sources of big dataparticularly those from the Internet. For example,
Gong (2011) recently introduced new software that fuses snowball sampling methods
with screen-scraping technologies. The user simply inputs a starting website and a
classifying rule such as a Boolean search term or one of the other classification
algorithms described in further detail below. The software then visits each site that is
linked to the starting website and uses the classifying rule to decide whether it should
be included in the sample. If so, the program extracts all text from the site and repeats
the process of spideringlinks across multiple waves that are only constrained by
computer memory processing power. Given a number of different starting sites and a
sufficient number of waves, the Snowcrawl software produces a total sample of all
websites pertaining to a given topic. Although this tool is currently limited to the
Internet, a number of other qualitative data archives store relational data that could
potentially be analyzed using similar automated snowball methods. What is more, the
majority of newspapers, television stations, journals, or other texts of interest to cultural
sociologists are now available on the web.
A second promising tool for extracting large amounts of data from the web or
qualitative data archives are Application Programming Interfaces (APIs). These web-
based tools provide an interactive interface with large data archives that are designed to
enable targeted data extraction. They were developed primarily for consumer pur-
posessuch as the creation of third party applications for social media sites such as
20
While automated data extraction methods are particularly useful for mapping the contours of discursive
fields, it is important to note that such techniques do not capture the deeper preconscious cultural elements that
undergird social fields as Bourdieu and others have theorized them (e.g., Bourdieu 1990; Fligstein and
McAdam 2011;Martin2003). I return to the question of whether big data techniques can be leveraged to
classify such cultural elements in the following section as well as my discussion and conclusion.
21
For example, one might define a discursive field by identifying all texts with a certain set of keywords or
within a certain search index offered by text archives.
Theor Soc
Facebook, Twitter, or Googlebut a number of academics have begun to use them as
data collection tools as well (Bail 2013a;GabyandCaren2012; Livne et al. 2011).
Even conventional media outlets such as the New York Times now offer APIs that enable
users to search and download articles or user comments from their website. APIs are
superior to other forms of data extraction not only because they enable more sophisti-
cated targeting of different types of textsuch as Twitter messages about the Arab
Springbut also because such sites typically record a vast array of information about
the users of their sites as well as their behavior online. For example, TwittersAPI
enables rapid extraction of information about the online social networks of individual
users. Facebook and Googles API enable direct interface with its massive archives of
web content as well, but also includes information about the size, geographic location,
and demographic characteristics of the audiences of each site as well.
22
Classifying culture
Obtaining total or near total samples of text on a given topic is a remarkable feat given
that it was nearly unthinkable only a decade ago. Yet such giant samples are of little
utility if they cannot be classified in a meaningful manner. Cultural sociology has been
fascinated with classification since its inception because it was largely inspired by the
Durkheimian idea of classification struggles (e.g., Barth 1969; Bourdieu 1975;Douglas
1966;Latour1988). For example, Gieryn (1999) highlights the critical role of social
classification in the evolution of scientific fields. Lamont (1992,2000)explainshow
class and racial boundaries shape the process of group formation. Finally, Espeland and
Stevens (1998) make a broader argument about the key role of commensuration in
producing social power.
23
Yet for all the theoretical interest in the process of classifi-
cation, cultural sociologists seldom discuss the appropriate way to measure social
categories (Lamont and White 2009). Most studies either rely upon in-depth interviews
or case studies that highlight the social construction of ranking within institutions. The
lack of consensus about how to classify data has even prompted some critics to accuse
cultural sociologists of the reification of social classifications according to their theo-
retical persuasion (e.g., Biernacki 2012).
To date, cultural sociologists have scarcely explored the promise of automated text
analysis to classify texts.
24
Where these techniques have been used they have been relatively
primitive approaches to automation that simply identify keywords or phrases. This approach
is severely limited because it requires the researcher to have an a priori sense of which terms
are well suited to address the theoretical question of interest. Moreover, it eschews the
broader context of words within sentences. One solution to this problem is to evaluate the
co-prevalence of words within sentences using Global Regular Expression Print (GREP)
commands available in qualitative software analysis programs such as Atlas.TI or
22
Facebooks API requires user-authentication to access these data. Therefore, one must either access only
publicly available data or obtain an authentication token from a Facebook pages owner. Elsewhere, I argue
that app-based technologies are the most promising data collection tools to overcome such challenges. See
Bail (2013b).
23
For a recent review of this literature, see Lamont (2012).
24
Notable exceptions discussed in further detail below include Mohr (1998), Franzosi (2004), Bearman et al.
(1999), Bearman and Stovel (2000), Smith (2007), and Bail (2012).
Theor Soc
WordStat. Yet these approaches nevertheless fail to recognize important nuances in the
use of language. For example, a GREP search for sentences with the terms President
and hatewould reveal both I hate the President,and Id hate to be President.
Recent technological advances within the fields of computer science, pattern identifica-
tion, and linguistics have produced a variety of superior alternatives. I begin by reviewing
unsupervisedtext classification techniques that rely exclusively on computer algorithms to
create meaningful groupings of texts. For example, recent studies have invoked a number of
different forms of multi-dimensional scaling or cluster analysis to classify texts (e.g.,
Grimmer and King 2011; Livne et al. 2011).
25
These techniques replace each unique word
in a document with a number and then use various metrics to calculate dissimilarities among
all texts in the sample. These measures may be plotted within multidimensional space in
order to identify meaningful groupings of documents. A substantial problem with cluster
analysis is that the results are highly sensitive to the researchers assumptions about the
number of possible clusters (k), as well as the mathematical distances employed within each
algorithm. These idiosyncrasies can be controlled, however, if multiple forms of cluster
analysis are used in tandem. Grimmer and King (2011), for example, have developed
software that applies all existing variants of cluster analysis to large text corpora. They
apply this powerful tool to thousands of political texts by or about US presidents in order to
classify their ideological position on a range of substantive issues.
Another promising development within the big data movement for cultural sociologists is
the burgeoning field of machine learningand specifically the field of topic modeling. This
new field resulted from collaboration between linguists and computer scientists designed to
identify hidden or latent themes within large corpora.
26
Topic Models identify such themes
using probabilistic models that evaluate the co-occurrence of words. The most popular form
of topic modeling is Latent Dirichlet Allocation (LDA), which assumes a random allocation
of words across a latent theme or topic and then uses a generative process of classification to
analyze the probability of a document containing information about a topic given the
distribution of words therein.
27
Dozens of studies have used LDA or related Bayesian
approaches to infer latent topics in scientific journals, news articles, or blog posts (e.g., Blei
and Lafferty 2007; Hopkins and King 2010; Quinn et al. 2010). Despite these advances,
topic models have several considerable limitations. For example, the method assumes that
the order of words in a document does not matter, as well as the order of documents within
the broader sample. Most topic models also required that each document be assigned to
mutually exclusive categories, and do not recognize relationships between topics them-
selves. Basic topic models also do not recognize that topics may shift or combine over time.
Finally, topic modelsnot unlike cluster analysismust be validated in order to verify the
appropriate number of topics within a corpus.
28
This is particularly difficult given that many
cultural sociologists are interested in analyzing broad, unstructured samples of text such as
those described in the previous section of this article.
25
Mohr (1998) made early calls for cultural sociologists to adopt these methods to classify meaning structures, yet
they were mostly ignored even as they become widely used by cognitive anthropologists (e.g., DAndrade 1995).
26
For an overview of this field, see Blei (2012).
27
For a technical overview of LDA, see Blei et al. (2003).
28
A number of scholars have proposed validity measures for LDA, most recently Blei (2012). Most of these
emphasize comparisons of topic models via log-likelihoods or harmonic means, yet most proponents of topic
modeling agree that they must also be validated via qualitative inspection of individual topics within subsets of
large samples.
Theor Soc
Proponents of topic modeling have already begun to develop a number of solutions
to these limitations of this method, though they are too technical to discuss here.
29
Among the more promising recent developments in the field is the advent of super-
visedtopic modeling (Blei and McAuliffe 2010). In this technique, a human coder
identifies topics within a subset of documents, and topic models use these assignments
to assess probability instead of assuming that the distribution of topics across docu-
ments is random. Supervised text classification was first introduced within social
science by Hopkins and King (2010), who used this approach to assess public opinion
of presidential candidates expressed upon thousands of political blogs during the 2008
election.
30
Given a sufficient number of training documentsproduced through in-
depth coding, these authors argue that their technique classifies sentiment about
presidential candidates more reliably than human coders themselves.
31
While such
claims have not yet been widely validated, supervised learning techniques hold con-
siderable promise for the purpose of identifying cultural elements within texts and
further improving the snowball sampling methods described above.
32
Perhaps the most important question for cultural sociologists interested in employing
topic models is whether they can be used to classify cultural elements such as frames,
symbolic boundaries, or cultural toolkits. A number of current studies suggest topic
models may be used to capture such nuanced cultural elements. For example, Dimaggio
et al. (forthcoming) argue topic models can be used to identify frames about arts
funding. Polletta is currently using topic modeling to identify hidden frames in Internet
discussions about cap-and-trade.
33
Hopkins (2013) employs topic models to measure
frames about the Affordable Care Act. Yet a key issue remains whether cultural
framesas Goffman (1974) first defined themcan be represented by groups of
words. While the face-work that Goffman emphasized is clearly not measurable
through text, Goffman himself used texts extensively throughout his work, including
biographies, newspaper clippings, and transcripts of interactions.
34
Although Goffman
emphasized the absence of certain words as much as the presence of othersthese
omissions could be modeled effectively because they would shape the probability
distributions around groups of words that LDA analyzes to create classifications of
texts. Nevertheless, the quality of supervised topic modeling is only as good as the
codes developed by human coders themselves. Therefore, cultural elements that are
highly nuanced or situation-based are not easily captured via this technique because of
low inter-coder reliability.
35
29
For example, see Blei and Lafferty (2006), Wallach (2006), Chang et al. (2009), and Hopkins and King
(2010).
30
See also Grimmer (2010) and Quinn et al. (2010).
31
In particular, Hopkins and King (2010) argue that coding more than 500 documents produces diminishing
returns in the reliability of automated text analysis.
32
For example, a supervised topic model can be used to determine whether websites should be included in a
directed web-crawl such as SnowCrawl to capture sites that discuss a theme or topic without using a single
key-word.
33
See Baumer et al. (2013).
34
Consider, for example, the diaries analyzed in Goffman (1963) or the newspaper clippings in Goffman
(1974). Also, textual descriptions of face-work or other unspoken forms of bodily interaction in the form of
field notes could potentially be analyzed using topic models.
35
For a discussion of the challenges of achieving high levels of inter-coder reliability in cultural analysis, see
Krippendorff (2003).
Theor Soc
On the other hand, the meticulous coding definitions required by topic models may also
provide an opportunity for cultural sociologists to contribute new methodologies to the big
data movement. Indeed, the use of generative and multi-stage coding schemes has been a
key concern of cultural sociology in the form of thick description(e.g., Geertz 1973),
middle-range theory(Merton 1949), structural hermeneutics(Alexander and Smith
2001), and paradigmatic clusters(Weber 2005). Each of these approaches emphasizes
that researches should move back and forth between different levels of analysis to tune their
coding schemes and to assess the scope conditions of a particular finding. To this end, the
expertise of cultural sociologists may be applied to repeated stages of supervised topic
modelselaborating classification systems as if they were Russian Dolls, to borrow
Bourdieus metaphor. Mohr et al. (2014), for example, have advanced this technique in
their study of US National Security Statements over a 22 year period. By developing
increasingly precise codes from iterative qualitative analysis of small sub-sets of this large
corpus of text, these scholars have developed increasingly promising topic models that can
later be applied to the entire sample. We need further empirical validation of such techniques.
At the very least, however, such methods provide a systematic method of focusing
qualitative microscopes within the increasingly overwhelming world of big data.
Tracing the evolution of cultural environments
One of the most promising elements of the big data movement is that so much of the
qualitative data that has been collected is longitudinal. For example, the Library of
Congresss archive of all Twitter messages will enable unprecedented analysis of how
different issues rise and fall over time. The Internet Archive and screen-scraping
technologies could be used to map shifts in the discourses of different types of websites
over time. Likewise, the massive newspaper and television transcript archives now
available could be used to analyze similar issues over the past century. These longitu-
dinal data are particularly promising because so many of the most pressing questions in
cultural sociology concern change over time. While Swidlers(1986) toolkit analogy
has received extensive attention in recent decades, for example, her call for future
studies to examine the transition from unsettled to settled historical periods has been
mostly ignored.
36
While Sewells(1996) theory of events has inspired considerable
interest, few studies place such events in broader historical context.
37
Finally, Lamonts
(1992) work reveals considerable cross-national differences in the salience of symbolic
boundaries. Yet we urgently need broad historical analyses to identify how such
divergent meaning systems evolved over time. Each of these outstanding questions
requires methods capable of capturing broad-scale cultural change.
In addition to identifying cultural elements such as frames or symbolic boundaries,
automated text analysis can be used to differentiate social actors or key events within
36
But see Cerulo (1998), Wagner-Pacifici (2010), and Bail (2012).
37
Still, historical analyses with big data are limited by the availability of texts produced during this period that
were amenable to digitization. This presents a number of important limitations, including pervasive illiteracy
during early historical periods as well as the tendency for only elite accounts of historical events to survive the
passage of time. Still, comparative-historical sociologists face these problems regardless of whether they are
working with big data. Furthermore, primary documents obtained through archival analysis can be easily
digitized through photographs, scanning, and text-recognition technologies.
Theor Soc
large qualitative datasets.
38
Cultural sociologists can make huge strides towards
advancing theories of social change simply by mapping the relations among cultural
elements, actors, and events over time. The literature on quantitative narrative analysis
has already established how analysis of relationships between actors and events can be
used to map broad historical sequences (e.g., Bearman et al. 1999; Bearman and Stovel
2000; Franzosi 2004; Smith 2007). Incorporating cultural elements identified via topic
modeling into such methods would open exciting new lines of inquiry about the
interpenetration of culture and structure. If topic modeling can be used to identify
actors and organizations as well as the cultural elements they produce, for example,
social network relationships might be mapped onto cultural patternsor vice versa. At
a minimum, mapping the relationships among cultural elements, actors, and events
would help focus in-depth qualitative analysis of key historical shifts or turning
points(Abbott 1997) where meaning structures change.
39
One problem, of course, is that cultural elements themselves often change through-
out such broad-scale historical transformations. Sewell, for example, argues the very
concept of revolution was developing at the same time that murderous mobs stormed
the Bastillesetting off the French revolution, before they knew precisely what they
were doing. Topic models are ill equipped to capture such nuances unless human
coders calibrate them repeatedly across multiple time periods. Even then, slight shifts in
cultural elements may be difficult to code automatically because human coders may
struggle to achieve high inter-coder reliability. Here again, new tools for automated text
may prove useful. For example, several new methods have been developed to identify
dissimilarities between pairs of documents. Primitive forms of these techniques simply
count the number of words shared between the two documents. Yet recent advances in
plagiarism detection software employ word-mapsthat utilize data from thesaurus in
order to identify nearmatches between two documents as well (e.g., Bail 2012).
Once again, these document comparison tools will not identify cultural elements by
themselves. Yet they may be particularly powerful when combined with topic models
and micro-level qualitative analysis of key texts or transitional moments within history.
Another major advantage of big data is that much of it includes detailed information
about relationships between social actors. This is particularly true of social media sites
such as Twitter or Facebook, but advances in library science are also creating hyper-
links between texts within archival collections as well. Using TwittersApplication
Programming Interface, once can easily extract not only all the messages produced by a
single actor, but also the precise location of this actor within a broader social network
including measures of both inand outdegree. Livne et al. (2011) for example,
extracted 460,000 tweets from all candidates for US House, Senate, and Gubernatorial
elections between 2006 and 2010. Their data not only reveals the partisan networks of
such social actors, but also patterns in the similarity of the language they post on Twitter
via cluster analysis. Through this analysis, Livne et al. document the meteoric rise of the
38
If key actors or events are already known, simple key word searches or Global Regular Expression Print
(GREP) commands may also be used to identify them. If actors or events are not known, they can be identified
through keyword counts that remove common words such as theor and.Once actors or events are
defined, topic models may be used to identify them as well. A number of computer scripts have also been
recently developed to identify names within big data without such intermediary steps such as the Natural
Language Toolkit and the Stanford Parser.
39
See also Sewell (1996) and Wagner-Pacifici (2010).
Theor Soc
Tea Party in recent elections, and the realignment of mainstream conservative networks
that ensued. These and other datasets could be used toaddress a number ofkey questions
at the intersection of cultural sociology and network theory. For instance, Pachucki and
Breigers(2010) argument about cultural holeswithin networks, or Vaisey and
Lizardos(2010) theory that cultural worldviews influence network composition.
40
The potential to assemble large datasets that describe cultural elements, actors,
events, and social networks over time may also encourage critical advances in field
theory. Most of the most pressing questions in this literature are about the evolution of
fields over time (Fligstein and McAdam 2011; Padgett and Powell 2012). For instance,
a number of recent studies have begun to analyze the emergence of fields (e.g.,
Armstrong 2002;Bartley2007). By and large, these case studies are unable to
investigate a variety of broad cultural processes that may occur between discursive
fields. For example, do most fields emerge out of the dissolution of others? Or, do fields
develop when the space between any two pre-existing fields is sufficiently broad (Eyal
2009;Medvetz2012)? Big data may also enable analysis of a number of intriguing
questions within individual fields as well. For example, do discursive fields have
carrying capacities for new forms of culture? Do certain actors gain power within
discursive fields by exploiting niches between rival factions? Or, what is the relation-
ship between the core and periphery of discursive fields (e.g., Bail 2012)?
Another exciting feature of big data is that it often includes geo-coded data. For
example, Twitter and Facebook record the geographic location of their users. This
information is also often recorded on the comments sections of websites. Finally,
analytics or insightsdata often include the latitude and longitude of visitors to
different websites via Internet Protocol (IP) addresses or other geographic identifiers
such as city names. Political scientists have even mined visual data on ethnic conflict
from Google Earth (Agnew et al. 2008). The potential to look at the relationship
between Cartesian coordinates and cultural elements could create a new subfield within
cultural sociology that analyzes the geography of meaning. Such a field might examine
questions such as: 1) Do cultural frames or symbolic boundaries cluster at the national
level or supranational levels? 2) Does physical proximity breed more convergence of
worldviews than online interaction? Finally, does the answer to these two questions
change over time as the forces of globalization push people ever closer together?
Conclusion
Cultural sociology has long suffered from an imbalance of theory and data (Ghaziani
2009). Yet the big data movement may radically alter this equilibrium. The big data
movement began with the Internet and social media, but the future of the field will also
entail increasingly ambitious forays into the past. As digitized historical archives
continue to expand and social scientists coordinate new ways of organizing qualitative
meta data with rich detail about the evolution of meaning, cultural sociologists can no
longer afford to ignore the big data movement. Above, I argued that integration of in-
depth qualitative coding techniques pioneered by cultural sociologists and anthropol-
ogists can be leveraged to improve already powerful automated text analysis techniques
40
On the concept of cultural holes, see also Lizardo (in this issue).
Theor Soc
produced by computer scientists, linguists, and political scientists. This synthesis will
enable cultural sociologists to achieve theoretical progress on questions that were once
thought un-measurable. Proponents of big data may also gain key insight from cultural
sociologists about how to further hone their tools to map the contours of cultural fields,
classify cultural elements, and trace the evolution of culture over time.
Yet for all of my optimism about the marriage of cultural sociology and big data,
formidable obstacles remain. Perhaps the most vexing problem is that big data often
does not include information about the social context in which texts are produced
(Griswold and Wright 2004). Although we are able to collect millions of blog posts
about virtually any issue, these data typically include little or no information about the
authors of such postsor those who comment upon them. Twitter data are publicly
available, but provide very little information about the social context in which Tweets
are produced. Other sites such as Facebook collect massive amounts of data about
social context but are often unable to share them with researchers because of concerns
about user privacy. Sources of big data outside social media also often lack important
information about the social context in which texts are produced. Collecting every
newspaper article on a political topic is of marginal utility absent in-depth analysis of
the political and institutional processes that lead media to gravitate towards one issue
over another.
41
Yet these obstacles are not without solutions that might build upon the
progress of cultural sociologists in developing mixed-method research designs. For
example, qualitative or quantitative surveys of Twitter users could be conducted to
place their online behavior within broader context. Or, large-scale analyses of media
data or historical surveys might be used to identify compelling puzzles for comparative
historical analysis. In theory, big data could also be used to guide ethnographic
interventions as wellor at least help place the findings of ethnography within broader
cultural fields. In brief, big data methods should be viewed as a complementnot a
replacementfor the tried and tested techniques of cultural sociology.
A second major challenge is that computer-assisted coding can never be more
reliable than the codes themselves. Cultural sociologists seldom discuss coding criteria
or inter-coder reliability, in part because the definition of many of our core concepts is
highly contested (Biernacki 2012). One need only read the literature on framing, for
example, to witness significant disagreement about whether and how they should be
measured or operationalized.
42
While these debates will not be easily resolved, the
integration of big data and cultural sociology will depend critically upon our capacity to
converge upon several broadly accepted definitions of these core concepts. Yet big data
may actually facilitate such conversationssince conceptual vagueness among cultural
sociologists results in part from our paucity of shared datasets. Cultural sociologists are
also looking across disciplinary lines for guidance in making core concepts more
concrete. For example, Mohr et al. (2013) have fused the literatures on narrative from
41
It is also worth noting that texts that cannot be collected because they are not in the public domain may
ultimately have less impact upon the evolution of broader cultural domains precisely because they are hidden
from public view. This underlies a broader pragmatist argument about the need to focus attention upon
consequences of social action (e.g., Johnson-Hanks et al. 2011; Tavory and Timmermans 2013). An interesting
analogue is the debate about the social construction of ethnicity via the enumeration of different groups by the
US Census (cf. Loveman and Muniz 2007). I thank Andy Perrin for bringing this issue to my attention.
42
For a detailed analysis of conceptual and methodological ambiguities in the measurement of frames, see
Scheufele (1999).
Theor Soc
linguistics with studies of social networks and topic modeling from sociology and
computer science. Polletta is currently synthesizing linguistics and cultural sociology
using new visualization techniques that enable them to explore how making people
aware of their cultural schemas shapes their behavior during democratic deliberation.
43
Finally, Ignatow and Mihalcea (2013) propose a model for big data analysis that
synthesizes neuroscience and Bourdieusian practice theory.
A final concern for cultural sociologists is the relatively high entry cost for those
who wish to develop the technical expertise currently necessary to work with big data.
44
Although these costs are rapidly decreasing thanks to simple web-based tools for big
data analyses, formalizing these techniques for cultural sociology will require a new
generation of scholars with both technical expertise and theoretical ambition. For now,
the big data movement urgently requires the guidance of theoretically and qualitatively
oriented cultural sociologists. Little can be learned from big data without big thinking.
While data mining may reveal interesting patterns in large text corpora or compelling
visualizations, many pieces of hay have come to resemble needles.
45
Therefore, the
future of the big data movement hinges upon collaboration among cultural sociologists,
computer scientists, and others to teach computers to differentiate different types of
meaning and their shifting relationships over time.
Acknowledgments I thank Elizabeth Armstrong, Alex Hanna, Gabe Ignatow, Charles Kurzman, Brayden
King, Jennifer Lena, John Mohr, Terry McDonnell, Andy Perrin, and Steve Vaisey for helpful comments on
previous drafts. The Robert Wood Johnson Foundation and the Odum Institute at the University of North
Carolina provided financial support for this research.
References
Abbott, A. (1995). Things of boundaries. Social Science Research, 62(4), 857882.
Abbott, A. (1997). On the concept of turning point. Comparative Social Research, 16,85106.
Abbott, A. (2001). Chaos of disciplines. Chicago: University of Chicago Press.
Agnew, J., Gillespie, T., Gonzalez, J., & Min, B. (2008). Baghdad nights: Evaluating the US military surge
using nighttime light signatures.
Alexander, J. (2006). The civil sphere. Oxford: Oxford University Press.
Alexander, J., & Smith, P. (2001). The strong program in cultural theory: Elements of a structural hermeneu-
tics. In J. H. Turner (Ed.), Handbook of sociological theory (pp. 135150). New York: Springer.
Armstrong, E. A. (2002). Forging gay identities:Organizing sexuality in San Francisco, 19501994. Chicago:
University of Chicago Press.
Bail, C. (2012). The fringe effect: civil society organizations and the evolution of media discourse about Islam,
20012008. American Sociological Review, 77(7), 855879.
43
See Baumer et al. (2013)
44
Efforts are currently underway to make the collection and analysis of big data possible for those without a
computer programming background. Gary King and colleagues are producing a web-based tool named
Consiliencethat will enable cluster analysis of unstructured text. Primitive forms of topic modeling and
sentiment analysis are available via a variety of web-based software programs as well such as www.
discovertext.com . Finally, there is a variety of high quality tutorials available online for those who wish to
develop basic programming skills for working with big data. For example, see http://nealcaren.web.unc.edu/
big-data/,andhttp://www.chrisbail.net/p/software.html. A complete list of tutorials is available at http://www.
chrisbail.net/p/big-data.html.
45
See Steve Lohr, The Age of Big Data,The New York Times, February 11, 2012.
Theor Soc
Bail, C. (2013a). Winning minds through hearts: Organ donation advocacy, emotional feedback, and
social media. Working Paper, Department of Sociology, University of North Carolina at Chapel
Hill.
Bail, C. (2013b). Taming big data: Apps and the future of survey research. Working Paper, Department of
Sociology, University of North Carolina, Chapel Hill.
Bail, C.A. (forthcoming). Terrified: How anti-muslim organizations became mainstream. Princeton University
Press, Princeton, NJ.
Barth, F. (1969). Ethnic groups and boundaries: The social organization of cultural difference. Boston: Little,
Brown.
Bartley, T. (2007). How foundations shape social movements: the construction of an organizational field and
the rise of forest certification. Social Problems, 54(3), 229255.
Baumer, E. P. S., Polletta, F., Pierski, N., Celaya, C., Rosenblatt, K., & Gay, G. K. (2013, February).
Developing computational supports for frame reflection. Retrieved from http://hdl.handle.net/2142/
38374.
Bearman, P., & Stovel, K. (2000). Becoming a Nazi: a model for narrative networks. Poetics, 27(2), 6990.
Bearman, P., Faris, R., & Moody, J. (1999). Blocking the future: new solutions for old problems in historical
social science. Social Science History, 23(4), 501533.
Benford, R., & Snow, D. (2003, November 28). Framing processes and social movements: An overview and
assessment. Review-article.
Biernacki, R. (2012). Reinventing evidence in social inquiry: Decoding facts and variables. New York:
Palgrave Macmillan.
Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 7784.
Blei, D., & Lafferty, J. (2006). International Conference on Machine Learning, ACM, New York, New York,
113120.
Blei, D., & Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 1735.
Blei, D., & McAuliffe, J. (2010). Supervised topic models. arXiv:1003.0783. Retrieved from http://arxiv.org/
abs/1003.0783.
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3,
9931022.
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. JournalofComputational
Science, 2(1), 18. doi:10.1016/j.jocs.2010.12.007.
Bourdieu, P. (1975). The specificity of the scientific field and the social conditions of the progres of reason.
Social Science Information, 14(6), 119.
Bourdieu, P. (1985). The social space and the genesis of groups. Theory and Society, 14(6), 723744. doi:10.
1007/BF00174048.
Bourdieu, P. (1990). Homo Academicus (1st ed.). Stanford: Stanford University Press.
Cerulo, K. A. (1998). Deciphering violence: The cognitive structure of right and wrong (1st ed.). New York:
Routledge.
Chang, J., Boyd-graber, J., Gerrish, S., Wang, C., & Blei, D. M. (2009). Reading tea leaves: How humans
interpret topic models.
Collins, R. (2013). Solving the Mona Lisa smile, and other developments in visual micro-sociology. Working
Paper, Department of Sociology, University of Pennsylvania.
DAndrade, R. G. (1995). The development of cognitive anthropology. Cambridge: Cambridge University
Press.
DiMaggio, P. (1997). Culture and cognition. Annual Review of Sociology, 23,263287.
DiMaggio, P., & Bonikowski, B. (2008). Make money surfing the web? The impact of internet use on the
earnings of U.S. workers. American Sociological Review, 73(2), 227250. doi:10.1177/
000312240807300203.
Dimaggio, P., Hargittai, E., Neuman, W. R., & Robinson, J. (2001). Social implications of the internet. Annual
Review of Sociology, 27, 307336.
Dimaggio, P., Nag, M., & Blei, D. (forthcoming). Exploiting affinities between topic modeling and the
sociological perspective on culture: Application to newspaper coverage of government arts funding in the
U.S. Poetics, Page numbers unknown.
Douglas, M. (1966). Purity and danger: An analysis of concepts of pollution and taboo. New York: Praeger.
Douglas, M. (1986). How institutions think. Syracuse: Syracuse University Press.
Eliasoph, N., & Lichterman, P. (2003). Culture in interaction. American Journal of Sociology, 108(4), 735
794.
Espeland, W. N., & Stevens, M. L. (1998). Commensuration as a social process. Annual Review of Sociology,
24,313343. doi:10.2307/223484.
Theor Soc
Evans, R., & Kay, T. (2008). How environmentalists Greenedtrade policy: strategic action and the
architecture of field overlap. American Sociological Review, 73(6), 970991. doi:10.1177/
000312240807300605.
Eyal, G. (2009). The space between fields. Working Paper, Center for Comparative Research, Yale University.
Fligstein, N., & McAdam, D. (2011). Toward a general theory of strategic action fields. Sociological Theory,
29(1), 126.
Foucault, M. (1970). The order of things: An archaeology of the human sciences (1st ed.). New York: Vintage.
Franzosi, R. (2004). From words to numbers: Narrative, data, and social science. Cambridge: Cambridge
University Press.
Franzosi, R. (2009). Quantitative narrative analysis (1st ed.). Thousand Oaks: SAGE Publications, Inc.
Gaby, S., & Caren, N. (2012). Occupy online: how cute old men and Malcolm X Recruited 400,000 U.S. users
to OWS on Facebook. Social Movement Studies, 11,367374.
Geertz, C. (1973). The interpretation of cultures: Selected essays. New York: Basic Books.
Ghaziani, A. (2009). An amorphous mist? The problem of measurement in the study of culture. Theory and
Society, 38(6), 581612. doi:10.1007/s11186-009-9096-2.
Ghaziani, A., & Baldassarri, D. (2011). Cultural anchors and the organization of differences. American
Sociological Review, 76(2), 179206. doi:10.1177/0003122411401252.
Gieryn, T. F. (1999). Cultural boundaries of science: Credibility on the line (1st ed.). Chicago: University of
Chicago P ress.
Goffman, E. (1963). Stigma: Notes on the management of spoiled identity. New York: Touchstone.
Goffman, E. (1974). Frame analysis. Cambridge: Harvard University Press.
Gold, M. K. (2012). Debates in the digital humanities. Minneapolis: U of Minnesota Press.
Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and day length across
diverse cultures. Science, 333(6051), 18781881. doi:10.1126/science.1202775.
Gong, A. (2011). An automated snowball census of the political web. SSRN eLibrary. Retrieved from http://
papers.ssrn.com/sol3/papers.cfm?abstract_id=1832024.
Grimmer, J. (2010). A Bayesian hierchical topic model for political texts: measuring expressed agendas in
senate press releases. Political Analysis, 18(1), 135.
Grimmer, J., & King, G. (2011). General purpose computer-assisted clustering and conceptualiza-
tion. Proceedings of the National Academy of Sciences, 108(7), 26432650. doi:10.1073/pnas.
1018067108.
Griswold,W.,&Wright,N.(2004).Wiredandwellread.InSociety online: The internet in context. New York: Sage.
Hopkins, D. (2013). The exaggerated life of death panels: The limits of framing effects in the 20092012
health care debate. Working Paper, SSRN.
Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science.
American Journal of Political Science, 54(1), 229247. doi:10.1111/ j.1 540-59 07. 2009.00 428 .x.
Ignatow, G., & Mihalcea, R. (2013). Text mining for comparative cultural analysis. Working Paper,
Department of Sociology, University of North Texas.
Johnson-Hanks, J., Bachrach, C., Morgan, P., & Kohler, H.-P. (2011). Understanding family change and variation:
toward a theory of conjuctural action. Understanding Population Trends and Processes, 5,1179.
Kaufman, J. (2004). Endogenous explanation in the sociology of culture. Annual Review of Sociology, 30,
335357.
King, G. (2011). Ensuring the data rich future of the social sciences. Science, 331(11 February), 719721.
Krippendorff, K. H. (2003). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks:
Sage Publications, Inc.
Lamont, M. (1992). Money, morals, and manners: The culture of the French and American upper-middle
class. Chicago: University of Chicago Press.
Lamont, M. (2000). The dignity of working men: Morality andthe boundaries of race, class, and immigration.
New York: Russell Sage.
Lamont, M. (2012). Toward a comparative sociology of valuation and evaluation. Annual Review of Sociology,
38,201221.
Lamont, M., & White, P. (2009). The evaluation of systematic qualitative research in the social sciences.
Report of the U.S. National Science Foundation.
Lan, T., & Raptis, M. (2013). From subcategories to visual composites: A multi-level framework for object
recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Latour, B. (1988). How to follow scientists and engineers through society. Cambridge: Harvard University Press.
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A.-L., Brewer, D., et al. (2009). SOCIAL
SCIENCE: computational social science. Science, 323(5915), 721723. doi:10.1126/science.
1167742.
Theor Soc
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., & Christakis, N. (2008). Tastes, ties, and time: a new
social network dataset using Facebook.com. Social Networks, 30(4), 330342. doi:10.1016/j.socnet.2008.
07.002.
Lieberson, S. (2000). A matter of taste: How names, fashions, and culture change. New Haven: Yale
University Press.
Livne, A., Simmons, M. P., Adar, E., & Adamic, L. (2011). The party is over here: Structure and content in the
2010 election. Proceedings of the Fifth Intenrational AAAI Conference on Weblogs and Social Media,
201209.
Loveman, M., & Muniz, J. (2007). How Puerto Rico became white: boundary dynamics and inter-census
racial classification. American Sociological Review, 72, 915939.
Manning, C. D., & Schuetze, H. (1999). Foundations of statistical natural language processing (1st ed.).
Cambridge: The MIT Press.
Mark, N. P. (2003). Culture and competition: homophily and distancing explanations for cultural niches.
American Sociological Review, 68(3), 319345. doi:10.2307/1519727.
Martin, J. L. (2003). What is field theory? American Journal of Sociology, 109(1), 149.
Medvetz, T. (2012). The rise of think tanks in America: Merchants of policy and power. Chicago: University of
Chicago.
Merton,R.(1949).Social theory and social structure. New York: The Free Press.
Mische,A.(2008).Partisan publics: Communication and contention across Brazilian youth activist networks.
Princeton: Princeton University Press.
Mohr, J. (1998). Measuring meaning structures. Annual Review of Sociology, 24,345370.
Mohr, J., & Guerra-Pearson, F. (2010). The duality of niche and form: The differentiation of institutional space
in New York City, 18881917. In Categories in markets: Origins and evolution (pp. 321368). New
York: Emerald Group Publishing.
Mohr, J., Singh, A., & Wagner-Pacifici, R. (2013). CulMINR: Cultural meanings from the interpretation of
narrative and rhetoric: A dynamic network approach to hermeneutic mining of large text corpora.
Working Paper, Department of Sociology, University of California, Santa Barbara.
Mohr, J., Wagner-Pacifici, R., Breiger, R., Bogdanov, P. (2014). Graphing the grammar of motives in National
Security Strategies: cultural interpretation, automated text analysis, and the drama of global politics.
Poetics, 41(6), 670700.
Moretti, F. (2013). Distant reading. London: VERSO BOOKS.
Pachucki, M. A., & Breiger, R. L. (2010). Cultural holes: beyond relationality in social networks and culture.
Annual Review of Sociology, 36(1), 205224. doi:10.1146/annurev.soc.012809.102615.
Padgett, J. F., & Powell, W. W. (2012). The emergence of organizations and markets. Princeton: Princeton
University Press.
Paul, M. J., & Dredze, M. (2011). You are what you tweet: Analyzing Twitter for public health. Fifth
International Conference on Weblogs.
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political
attention with minimal assumptions and costs. American Journal of Political Science, 54(1), 209228.
doi:10.1111/j.1540-5907.2009.00427.x .
Scheufele, D. A. (1999). Framing as a theory of media effects. The Journal of Communication, 49(1), 103
122. doi:10.1111/j.1460-2466.1999.tb02784.x.
Sewell, W. (1996). Historical events as transformations of structures: inventing revolution at the Bastille.
Theory and Society, 25(6), 841881. doi:10.1007/BF00159818.
Smith, T. (2007). Narrative boundaries and the dynamics of ethnic conflict and conciliation. Poetics, 35,22
46.
Swidler, A. (1986). Culture in action: symbols and strategies. American Sociological Review, 51(2), 273286.
Swidler, A. (1995). Cultural power and social movements. In Social movements and culture. London:
Routledge.
Tangherlini, T. R., & Leonard, P. (2013). Trawling in the Sea of the Great Unread: sub-corpus topic modeling
and Humanities research. Poetics. doi:10.1016/j.poetic.2013.08.002.
Tavory, I., & Timmermans, S. (2013). Consequences in Action: A pragmatist approach to causality in
ethnography. Working Paper, New School for Social Research.
Vaisey, S., & Lizardo, O. (2010). Can cultural worldviews influence network composition? Social Forces,
88(4), 15951618. doi:10.1353/sof.2010.0009.
Wagner-Pacifici, R. (2010). Theorizing the restlessness of events. American Journal of Sociology, 115(5),
13511386.
Wallach, H. (2006). Topic modeling: Beyond bag of words. Proceedings of the 23rd International Conference
on Machine Learnings.
Theor Soc
Weber, K. (2005). A toolkit for analyzing corporate cultural toolkits. Poetics, 33(34), 227252. doi:10.1016/j.
poetic.2005.09.011.
Wuthnow, R. (1993). Communities of discourse: Ideology and social structure in the reformation, the
enlightenment, and European socialism. Cambridge: Harvard University Press.
Zelizer, V. A. R. (1985). Pricing the priceless child: The changing social value of children. Princeton:
Princeton University Press.
Christopher A. Bail is Assistant Professor of Sociology at the University of North Carolina, Chapel Hill. His
research interests include cultural sociology, political sociology, organizations, and mixed-method research
designs. He is currently completing a manuscript entitled Terrified:How Anti-Muslim Organizations Became
Mainstream. His other work has appeared in the American Sociological Review and the Revue Europeene des
Migrations Internationales.
Theor Soc
... This approach allows topics to emerge inductively from data, rather than being imposed on it, to provide data-driven insights into social phenomena that are not directly observable (Charmaz, 2014). However, because grounded theory relies on subjective decisions regarding data interpretation, it can lead to biased outcomes (Saldana, 2015), produce hypotheses that are difficult to replicate (Biernacki, 2012), and be less effective for analyzing large, unstructured datasets (Bail, 2014). To address these limitations, computational grounded theory integrates qualitative research with computational techniques to analyze large datasets systematically for theory development Nelson, 2020;Rosenberg & Krist, 2021). ...
Article
Full-text available
Valid rubrics facilitate assessing the level of complexity in students’ open-ended responses. To design a valid rubric, it is essential to thoroughly define the types of responses that represent evidence of varying complexity levels. Formulating such evidence statements can be approached deductively by adopting predefined criteria from the research literature or inductively by detecting topics, for example, based on data-driven machine learning (ML) techniques. Investigating the interdependence of such research-informed and ML-based rubrics is key to validating ML-based approaches and enhancing their applicability in formative assessments. This study quantitatively compares a research-informed and an ML-based rubric designed to capture the complexity of students’ reasoning on the relative rate of contrasted reactions in undergraduate organic chemistry. Specifically, we leveraged an ML-based clustering technique to inductively develop a holistic fifteen-category rubric to evaluate students’ open-ended reasoning. Subsequently, we performed a quantitative analysis to examine whether the ML-based rubric and its research-informed counterpart are significantly associated. Our findings indicate that research-informed and ML-based rubrics assess students’ reasoning comparably. Thus, both rubric types are valid for categorizing students’ reasoning, underscoring the applicability of integrating ML techniques into rubric development. Nevertheless, aligning ML-based rubrics with the respective assessment objectives remains crucial. A well-aligned, evidence-based rubric may ultimately ease the analysis of student reasoning.
... Although text analysis has long been a cornerstone for the study of culture, the impact of "big data" on the sociology of culture remains modest (Bail 2014) . A fundamental challenge for the computational analysis of text is to simultaneously leverage the richness and complexity inherent in large corpora while producing a representation simple enough to be intuitively understandable, analytically useful and theoretically relevant. ...
Preprint
We demonstrate the utility of a new methodological tool, neural-network word embedding models, for large-scale text analysis, revealing how these models produce richer insights into cultural associations and categories than possible with prior methods. Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture. We show that dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning, and the projection of words onto these dimensions reflects widely shared cultural connotations when compared to surveyed responses and labeled historical data. We pilot a method for testing the stability of these associations, then demonstrate applications of word embeddings for macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century and a comparative analysis of historic distinctions between markers of gender and class in the U.S. and Britain. We argue that the success of these high-dimensional models motivates a move towards "high-dimensional theorizing" of meanings, identities and cultural processes.
... W kontekście dynamicznego rozwoju technologii połączenie metod humanistyki cyfrowej z socjologią jakościową otwiera nowe możliwości badawcze i przynosi liczne korzyści. Cyfrowe narzędzia analityczne pozwalają na pracę z większymi zbiorami danych, umożliwiając ich bardziej złożoną i precyzyjną analizę (Bail, 2014). Na przykład narzędzia do analizy tekstów mogą automatycznie kodować i badać tysiące stron wywiadów czy transkrypcji, identyfikując kluczowe tematy i wzorce. ...
Article
Full-text available
Socjologia jakościowa stoi przed wieloma wyzwaniami i możliwościami dzięki postępom w takich technologiach, jak sztuczna inteligencja, uczenie maszynowe i przetwarzanie języka naturalnego. Technologie te zmieniają sposoby analizy danych jakościowych, oferując nowe perspektywy badawcze i metodologiczne. Sztuczna inteligencja i uczenie maszynowe umożliwiają automatyzację analizy danych, efektywne zarządzanie dużymi zbiorami danych oraz wyciąganie bardziej precyzyjnych i rygorystycznych wniosków przy zachowaniu głębi badań jakościowych. Integracja metod humanistyki cyfrowej z socjologią jakościową ma ogromny potencjał, umożliwiający lepsze zrozumienie złożonych zjawisk i procesów społecznych. Specjalny numer „Przeglądu Socjologii Jakościowej” pokazuje, jak metody humanistyki cyfrowej mogą być z powodzeniem stosowane w badaniach i analizie danych jakościowych, oferując teoretyczne i praktyczne wskazówki dla badaczy. Artykuły zawarte w tym numerze dostarczają analitycznego know-how, inspirując badaczy do dalszego eksplorowania oraz wdrażania innowacyjnych metod humanistyki cyfrowej i technologii cyfrowych w polu socjologii jakościowej.
... We start our analysis by cleaning and preprocessing responses and analyzing them with topic models. Having the potential to help researchers uncover specific cultural elements (Bail, 2014), topic modeling has been extremely useful for examining open-ended responses (Roberts et al., 2014;Pietsch and Lessmann, 2018;Miner et al., 2023). We employ Biterm Topic Models, an approach more appropriate for analyzing shorter texts relative to alternative topic models (Yan et al., 2013). ...
... This is normally the case in cultural analytics, the use of analytic methods for the exploration of contemporary and historical cultures (Manovich 2020; Candia et al. 2019). Example range from archaeology-where related artifacts have a number of physical characteristics and can be from different places/ages (Schich et al. 2008;Brughmans 2013;Mills et al. 2013); to art history-where related visual artifacts can be described by a number of meaningful visual characteristics (Salah et al. 2013;Hristova 2016;Karjus et al. 2023); to sociology-where different ideas and opinions distribute over a social network as node attributes (Bail 2014;Hohmann et al. 2023); to linguistics-with different people in a social network producing content in different languages (Ronen et al. 2014); to music-with complex relations between players and informing metarelationships between the genres they play (McAndrew and Everett 2015;Vlegels and Lievens 2017). ...
Article
Full-text available
Cultural data analytics aims to use analytic methods to explore cultural expressions—for instance art, literature, dance, music. The common thing between cultural expressions is that they have multiple qualitatively different facets that interact with each other in non trivial and non learnable ways. To support this observation, we use the Italian music record industry from 1902 to 2024 as a case study. In this scenario, a possible research objective could be to discuss the relationships between different music genres as they are performed by different bands. Estimating genre similarity by counting the number of records each band published performing a given genre is not enough, because it assumes bands operate independently from each other. In reality, bands share members and have complex relationships. These relationships cannot be automatically learned, both because we miss the data behind their creation, but also because they are established in a serendipitous way between artists, without following consistent patterns. However, we can be map them in a complex network. We can then use the counts of band records with a given genre as a node attribute in a band network. In this paper we show how recently developed techniques for node attribute analysis are a natural choice to analyze such attributes. Alternative network analysis techniques focus on analyzing nodes, rather than node attributes, ending up either being inapplicable in this scenario, or requiring the creation of more complex n-partite high order structures that can result less intuitive. By using node attribute analysis techniques, we show that we are able to describe which music genres concentrate or spread out in this network, which time periods show a balance of exploration-versus-exploitation, which Italian regions correlate more with which music genres, and a new approach to classify clusters of coherent music genres or eras of activity by the distance on this network between genres or years.
Chapter
Considering the critical intersection of entrepreneurship, competitiveness and digital transformation in the age of Artificial Intelligence (AI), the drivers of economic growth and social progress and the transformative power of digitalization for firms become even more important. In particular, it is necessary to consider how digital entrepreneurship is reshaping industries and providing new opportunities for innovation and value creation. The theoretical framework for the adoption of AI and insights into organizational characteristics are presented, and the importance of AI in enhancing firms' competitive advantage is becoming increasingly evident.Therefore, in this chapter, AI applications in industries that demonstrate the significant impact of AI on decision-making processes and future business strategies are explained, and AI is examined as an important element in shaping the strategic direction and sustainability of companies in an increasingly competitive global market.
Article
Full-text available
Research Summary We develop a dynamic view of national entrepreneurial culture by examining the linguistic evolution of media‐produced cultural artifacts—entrepreneurship‐related newspaper articles. Applying machine learning to 690,088 articles from 103 newspapers across the United States between 1996 and 2016, we identify a growing positivity bias toward entrepreneurship at the national level evidenced by rising emotional tone and declining analytical thinking. This bias varies by topic, with “entrepreneurial aspirations and journeys” driving the trend. Our analyses also suggest this bias may encourage the creation of new ventures but limit venture growth potential. We highlight theoretical and methodological contributions to research on national entrepreneurial culture and identify promising avenues for future research. Managerial Summary We examine how a country's cultural attitudes toward entrepreneurship change over time by studying relevant newspaper articles. We also consider if any changes in such attitudes may have implications for the quantity and quality of a country's new ventures. After analyzing 690,088 articles from 103 newspapers across the United States between 1996 and 2016, we find a growing positivity bias toward entrepreneurship evidenced by increasing rates of positive tone and decreasing rates of analytical thinking. This bias is largest when media articles discuss entrepreneurial aspirations and journeys. Our analyses also suggest this bias may facilitate the creation of new ventures but limit their growth potential. These findings have implications for understanding and measuring national entrepreneurial culture, and create opportunities for future research.
Article
Incumbent firms resist change as they have much to lose if things go wrong. But inertia can also be costly, particularly in times of crisis. In a world on fire, does corporate transparency foster engagement with external threats to future organizational success? To answer this question, we conduct a mixed-methods analysis of 887 public earnings-calls meetings between executives, investors, and outside analysts from 24 major oil and gas companies in 2007–2020, a period marked by extreme weather events and increased popular and political pressure to slow global warming. We find that formal rituals that promote openness and debate to protect investors reproduce the status quo by cultivating narratives that deny the need for change.
Article
Encompassing new technologies, research methods, and opportunities for collaborative scholarship and open-source peer review, as well as innovative ways of sharing knowledge and teaching, the digital humanities promises to transform the liberal arts—and perhaps the university itself. Indeed, at a time when many academic institutions are facing austerity budgets, digital humanities programs have been able to hire new faculty, establish new centers and initiatives, and attract multimillion-dollar grants. Clearly the digital humanities has reached a significant moment in its brief history. But what sort of moment is it? This book explores its theories, methods, and practices and to clarify its multiple possibilities and tensions. From defining what a digital humanist is and determining whether the field has (or needs) theoretical grounding, to discussions of coding as scholarship and trends in data-driven research, this cutting-edge volume delineates the current state of the digital humanities and envisions potential futures and challenges. At the same time, several essays aim pointed critiques at the field for its lack of attention to race, gender, class, and sexuality; the inadequate level of diversity among its practitioners; its absence of political commitment; and its preference for research over teaching.
Book
Fertility rates vary considerably across and within societies, and over time. Over the last three decades, social demographers have made remarkable progress in documenting these axes of variation, but theoretical models to explain family change and variation have lagged behind. At the same time, our sister disciplines—from cultural anthropology to social psychology to cognitive science and beyond—have made dramatic strides in understanding how social action works, and how bodies, brains, cultural contexts, and structural conditions are coordinated in that process. Understanding Family Change and Variation: Toward a Theory of Conjunctural Action argues that social demography must be reintegrated into the core of theory and research about the processes and mechanisms of social action, and proposes a framework through which that reintegration can occur. This framework posits that material and schematic structures profoundly shape the occurrence, frequency, and context of the vital events that constitute the object of social demography. Fertility and family behaviors are best understood as a function not just of individual traits, but of the structured contexts in which behavior occurs. This approach upends many assumptions in social demography, encouraging demographers to embrace the endogeneity of social life and to move beyond fruitless debates of structure versus culture, of agency versus structure, or of biology versus society.