ArticlePDF Available

Finding Cultural Holes: How Structure and Culture Diverge in Networks of Scholarly Communication


Abstract and Figures

Divergent interests, expertise, and language form cultural barriers to communication. No formalism has been available to characterize these “cultural holes.” Here we use information theory to measure cultural holes and demonstrate our formalism in the context of scientific communication using papers from JSTOR. We extract scientific fields from the structure of citation flows and infer field-specific cultures by cataloging phrase frequencies in full text and measuring the relative efficiency of between-field communication. We then combine citation and cultural information in a novel topographic map of science, mapping citations to geographic distance and cultural holes to topography. By analyzing the full citation network, we find that communicative efficiency decays with citation distance in a field-specific way. These decay rates reveal hidden patterns of cohesion and fragmentation. For example, the ecological sciences are balkanized by jargon, whereas the social sciences are relatively integrated. Our results highlight the importance of enriching structural analyses with cultural data.
Content may be subject to copyright.
Finding Cultural Holes: How Structure and Culture Diverge in
Networks of Scholarly Communication
Daril A. Vilhena,aJacob G. Foster,bMartin Rosvall,cJevin D. West,aJames Evans,d
Carl T. Bergstroma
a) University of Washington; b) University of California—Los Angeles; c) Umeå University; d) University of Chicago
Divergent interests, expertise, and language form cultural barriers to communication. No formalism has been
available to characterize these “cultural holes.” Here we use information theory to measure cultural holes and demonstrate
our formalism in the context of scientific communication using papers from JSTOR. We extract scientific fields from the
structure of citation flows and infer field-specific cultures by cataloging phrase frequencies in full text and measuring
the relative efficiency of between-field communication. We then combine citation and cultural information in a novel
topographic map of science, mapping citations to geographic distance and cultural holes to topography. By analyzing the
full citation network, we find that communicative efficiency decays with citation distance in a field-specific way. These
decay rates reveal hidden patterns of cohesion and fragmentation. For example, the ecological sciences are balkanized by
jargon, whereas the social sciences are relatively integrated. Our results highlight the importance of enriching structural
analyses with cultural data.
Keywords: cultural holes; jargon; scholarly communication; content analysis; complex networks; information theory
Editor(s): Jesper Sørensen, Delia Baldassarri; Received: December 20, 2013; Accepted: February 8, 2014; Published: June 9, 2014
Vilhena, Daril A., Jacob G. Foster, Martin Rosvall, Jevin D. West, James Evans, and Carl T. Bergstrom. 2014. “Finding Cultural Holes: How
Structure and Culture Diverge in Networks of Scholarly Communication.” Sociological Science 1: 221-238. DOI: 10.15195/v1.a15
Copyright: c
2014 Vilhena, Foster, Rosvall, West, Evans, and Bergstrom. This open-access article has been published and distributed under a Creative
Commons Attribution License, which allows unrestricted use, distribution and reproduction, in any form, as long as the original author and source have
been credited.
holes have provided a fruitful
framework for analyzing the benefits that
people and institutions reap from their location
in a social network. Those whose networks span
gaps in the social fabric obtain information, re-
sources, and control through brokering social ac-
tors on either side. They also experience greater
freedom of action (Burt, 1992). Those with no
bordering holes experience greater competition
for resources and increased constraint. Riffing on
this generative term, Pachucki and Breiger coined
the companion concept of “cultural holes” to label
an emerging theme in cultural analysis (Pachucki
and Breiger, 2010). This theme emphasizes that
common culture—shared meanings, tastes, and
interests—enables ties between individuals and
institutions. When common culture is absent,
the resulting “cultural hole” makes existing ties
impoverished and new ties improbable or impos-
sible. In other words, gaps in the cultural fabric
may make it problematic or profitless to bridge
coinciding gaps in the social fabric (Xiao and
Tsui, 2007). Social actors who are structurally or
physically proximate may be kept apart by deep
divergences in matters of concern.
Despite widespread interest in the notion of
cultural holes, no unifying measurement frame-
work like Burt’s (1992) calculus of structural con-
straint has emerged to locate or quantify them.
This is largely to be expected, given their re-
cent introduction. But formalization is made
doubly difficult by the breadth of what sociolo-
gists label “culture,” from tastes (Erickson, 1996;
Lizardo, 2006; Vaisey and Lizardo, 2010) and val-
ues (Homans, 1961) to formal differences in artis-
tic genres (Han, 2003; Sonnett, 2004) or rifts be-
tween institutional logics (Friedland, 2009). Nev-
ertheless, underlying the apparent diversity of
cultural objects is a common capacity to circu-
late. This suggests an analytical focus on human
communication, construed here as the processes
through which values, tastes, styles, and logics
are transmitted. Indeed, cultural circulation is
typically analyzed through its symbolic or linguis-
sociological science | 221 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
tic traces, and scholars of culture have explored
the semiotics of everything from slang, gesture,
and clothing to electronics and condoms (Tavory
and Swidler, 2009). Even idiosyncratic cultural
variations like the “important matters” probed by
the General Social Survey are revealed through
communication and comprise unusual, chatty top-
ics like “eating less red meat” or “cloning head-
less frogs” (Bearman and Parigi, 2004). In other
words, communication reflects a vast array of
cultural objects and differences, ranging from
matters of concern to forms of personal expres-
Insofar as culture is characterized by its shared,
communicated quality—and differences in com-
municated content reflect underlying cultural
differences in taste and interest—we define cul-
tural holes in terms of the communicative burden
placed on parties to an interaction. To put an eth-
nomethodological gloss on this definition, more
work is required to sustain an interaction when
underlying cultural differences lead one or both
parties to transmit a stream of unfamiliar and un-
expected symbols or behaviors (Garfinkel, 1991).
More concretely, consider “epistemic cultures” in
science (Knorr-Cetina, 1999) and their use of
culture-specific language, or jargon. Jargon re-
flects a culture’s matters of special concern, focus,
and expertise, often through the coinage of com-
pressed terms for frequently–used referents (ar-
guably, this is exactly what Pachucki and Breiger
have done with “cultural hole”). “Every science
requires a special language,” wrote the French
philosopher Etienne Bonnot de Condillac, “be-
cause every science has its own ideas” (de Condil-
lac, 1782). This maxim makes two points that are
often overlooked in the quantitative analysis of
scientific communication, although these points
generalize to all forms of human communication.
First, each (scientific) culture has an extensive
mental catalog of concepts that reflect distinct
concerns (de Condillac’s “ideas”). Second, (sci-
entific) cultures often develop specialized terms
(jargon) to refer efficiently to the most common
concepts. Discipline-specific jargon allows scien-
tists to communicate with other members of their
discipline more concisely and precisely than would
be possible using everyday language. When an
evolutionary biologist uses the term fitness land-
scape, this is a highly condensed shorthand for
comparing expected relative reproductive success
across multiple genotypes. An intertidal ecologist
would require little further explanation to under-
stand fitness landscape, but a financial economist
might need to spend a long time on Wikipedia
before she got a handle on it. Compared to the
ecologist, the economist would need to put more
interpretive work into sustaining the interaction.
In other words, not only does every science have
its own ideas, but these ideas, and their associ-
ated jargon, may or may not overlap with those
dear to other disciplines.
Note that the same
linguistic compressions occur in a regional dialect,
a subcultural lingo, or a criminal argot, although
the balance of intentions behind the compression—
efficient communication with insiders or cultural
distinction from and exclusion of outsiders—may
vary from case to case.
These simple observations have profound im-
plications for the study of culture and its commu-
nication. In science, information does not flow
seamlessly from author to publication to reader.
It is expressed in a particular language, reflecting
particular concerns, with important consequences
for its transmission. Jargon allows scientists to
communicate new results quickly and effectively
within the context of discipline-specific paradigms
and interests, but it inhibits efficient knowledge
transfer in many other situations (see Basil Bern-
stein’s work on restricted and elaborated codes
(Bernstein, 1964) for a similar concept). To list
just a few examples, specialist language impedes
communication when sharing medical informa-
tion with a patient (Reach, 2009), publishing ma-
terial intended for public outreach (Richardson,
2010), and presenting technical information to a
multidisciplinary audience (Bischof and Eppler,
2010). Indeed, jargonistic compression can be
used intentionally to obscure information (jargon
as encryption) or reify cultural boundaries (jar-
gon as shibboleth; (Sokal and Bricmont, 1998)).
Technical language in patents provides an ex-
treme case of deliberate obscurantism (Feldman,
2008). In other words, jargon accelerates the flow
of information within disciplines by compressing
language, but impedes communication between
disciplines and makes knowledge transfer more
difficult. As interests diverge and jargon becomes
Consider “backward induction” versus “satisficing” as
explanations for action; the underlying ideas and beliefs
are not only distinct but mutually exclusive. We thank
Gabriel Rossman for suggesting this example.
sociological science | 222 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
more central to scientific discourse in an area, it
creates a chasm—a cultural hole—that impedes
easy access from outside.
This trade–off between efficient communica-
tion within local cultures and inefficient commu-
nication between them merits closer attention.
No general framework has been available for ex-
ploring how jargon and the underlying patterns
of interest affect the structure of communication,
and vice versa. In this article, we introduce an
information-theoretic model of communication
and develop a simple measure of cultural holes
with a clear qualitative interpretation. To illus-
trate our method, we deploy it on a large col-
lection of scientific papers in the JSTOR corpus.
These articles have both full–text and citation in-
formation. This allows us to relate cultural infor-
mation (captured by the full text) with structural
traces of interaction (captured by the citation net-
work). In citation networks, nodes represent pa-
pers and directed links represent citations among
papers. We use the citation network to extract
the macroscopic structure of scientific fields in
JSTOR, building on extensive work that maps
science in this way, e.g., (Rosvall and Bergstrom,
2008; Leydesdorff and Rafols, 2008; Small, 1999;
Boyack et al., 2005); see “Data and Methods” for
an extended technical description.
Note that our
use of the structural information in citation net-
works to identify groups is entirely consistent with
the parallel between structural and cultural holes
set up by Pachucki and Breiger. As suggested by
their account, we expect dense citations within a
field to signal deep linkages of mutual agreement
and awareness, or common culture. Conversely,
we anticipate sparse citations or “structural holes”
between fields to broadly coincide with cultural
holes of varying depth. In the analysis that fol-
lows, we test these two contentions.
Although we apply our cultural holes measure to sci-
entific fields as identified using the map equation, this
approach can be applied to any meaningful grouping of
documents, authors, or corpora and can scale down to
individual journals, authors, and articles. Indeed, a simi-
lar methodology is currently in use to screen submissions
to the online preprint server ArXiv (P. Ginsparg, private
communication, February 28, 2013). Articles with ex-
cessive jargon distance from existing fields are likely to
be produced by authors outside the mainstream research
community, on topics from perpetual motion machines to
novel theories of everything.
Using Information Theory to
Model Scholarly Culture and
To quantify the communicative burden imposed
by cultural holes, we construct a simple model of
symbolic communication. To make the discussion
concrete, we describe the model in the context
of science, but it is very general in scope. We
first consider a model for optimal communication
within a given scientific culture/field; then we
quantify the penalty for using these languages
between fields. In the “Data and Methods” sec-
tion, we show how to operationalize all building
blocks of our model using a corpus that combines
structural (citation) and cultural (text) traces.
Imagine a writer communicating with a reader
through a channel, for example, a scientific arti-
cle (Shannon, 1948). Let
denote the space of
all phrases that the writer and the reader might
use to communicate; these phrases broadly corre-
spond to concepts. The writer is characterized by
a codebook
that maps from phrases to code-
words; the subscript
denotes the writer’s field
of science or scholarship. The codebook
a corresponding probability distribution over a
random variable
, with values
x∈ X
. This
probability distribution tracks the importance of
each phrase in field
; important phrases are used
frequently, for example, fitness landscape in evolu-
tionary biology. The writer generates a message
by drawing phrases at random with probability
). In other words, she chooses phrases in pro-
portion to their importance in her particular sci-
entific culture. Now imagine that she transcribes
the phrases into codewords from whatever “lan-
guage” is appropriate for her reader’s scholarly
field, using that field’s codebook
. We assume
that the language of each scientific field is opti-
mized based on how frequently a given phrase is
used. This assumption is commonly used to ex-
plain the power–law distribution of English words
(Zipf, 1935, 1949). The optimum codeword for a
used in field
with probability
)in bits (Cover and Thomas,
Of course, unless she is writing explicitly for an in-
terdisciplinary audience, she will write in the language of
her own field. The transcription process described here is
a useful fiction, allowing us to estimate the effort required
when readers from various fields decode the resulting text.
sociological science | 223 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
We use codeword length as a proxy for
interpretive effort. Short codewords take less
time and effort to “unpack” than long codewords.
In essence, we turn the Zipfian principle of least
effort around; assuming field-specific language is
optimized for internal consumption, we can use
phrase frequency and hence codeword length to
infer interpretive effort.
First, consider a scientist from field
a message to a reader from the same field. Writer
and reader have the same probability distribution
and codebook, denoted by the blue boxes.
Writer Reader
The writer selects phrase
with probabil-
). Because she and her reader have the
same codebook, she encodes
with a codeword of
). The expected message length
per phrase is simply the Shannon entropy of
given probability distribution pi(x):
H(Xi) = X
pi(x) log2pi(x).(1)
This result follows directly from the Shannon
source coding theorem (Shannon, 1948). Indeed,
this is the most efficient encoding of messages
generated by sampling phrases
x∈ X
with the
writer’s probability distribution
). In cul-
tures where many phrases are used with equal
probability (i.e., probability mass is spread evenly
across phrases), the entropy will be large, as will
the average message length per phrase. If some
phrases are used very frequently, the entropy will
be smaller. These two situations model the effi-
ciency of jargon for communication within fields.
Now consider the more interesting case, in
which the writer and the reader come from dif-
ferent fields and have different codebooks. This
situation is represented below.
Technically, codewords must have integer lengths
)is the noninteger codeword length that gives
the correct lower bound on the optimal average codeword
length, H(Xi) = Px∈X pi(x) log2pi(x).
One could equivalently interpret
)as the
surprisal associated with the use of phrase
as the expected surprisal of a reader from field
a text generated according to the interests of field
, that
is, her own interests.
Writer Reader
What is the average message length per phrase,
given that the writer now encodes phrases opti-
for a reader in a different field? Denote
the writer’s probability distribution over phrases
)and the reader’s probability distribution
). The writer selects phrases
with proba-
)and encodes them in codewords with
). The expected length of the
writer’s message per phrase is then the cross en-
tropy of the two distributions (Cover and Thomas,
2006), that is, the entropy of
the Kullback–Leibler divergence between piand
pj(Kullback and Leibler, 1951):
Q(pi||pj) = X
pi(x) log2pj(x).(2)
This quantity will always be larger than the Shan-
non entropy; a message sent to a reader in a dif-
ferent culture will, on average, be longer than the
same message sent to a reader in the same one. In-
tuitively, a longer message will require more effort
to read and understand than a shorter message,
such that communication is more costly. Most
of the extra cost incurred comes from phrases
that are common in the author’s field but rare
in his or her reader’s field. Nontechnical phrases
(“here we show”) will have comparable frequencies
in both fields. If a particular phrase
is used
very frequently in field
and very rarely in field
, the respective codewords will vary enormously
in length,
). Insofar
as the length of codewords corresponds to the
time or effort required to decipher a message, an
article written by someone in field
will require a
reader from field
to expend much more energy
to understand it than a reader from field
. The
different message lengths thus reflect the ineffi-
cient communication between fields with distinct
scientific cultures. In the worst case scenario, the
two cultures concentrate their probability mass
(i.e., interests) on completely disjoint subsets of
They have entirely distinct jargon, forc-
Again, keep in mind that this encoding is really a
proxy for how much effort will be required from a reader
in a different field
As was famously described by C. P. Snow in a lecture
rather appropriately titled “The Two Cultures” (Snow and
Collini, 2012).
sociological science | 224 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
ing a reader from the one to look up nearly every
phrase used by an author in the other.
With these results in hand, we can quantify
the efficiency of communication from field
as the ratio of the average message length
within field
to the average message length be-
tween fields,
Eij =H(Xi)
Q(pi||pj)=Px∈X pi(x) log2pi(x)
Px∈X pi(x) log2pj(x),(3)
and similarly define the cultural hole experienced
by a reader from field
reading an article in field
= 1
. We denote the average cultural
hole around field
PjCij /N
, where we
sum over the N= 60 potential “reading” fields.
This operationalization of cultural holes, though
simple, has several advantages. First, unlike
most attempts to quantify semantic information
or measure cultural distance, ours is based on
an explicit model of communication and has
firm information-theoretic foundations (Cover
and Thomas, 2006).
Second, because we are
agnostic about what scholarly “phrases” are, we
can incorporate many kinds of language, from
mathematical formulae to canonical cases in le-
gal scholarship. Moreover, we can incorporate
other signals beyond those from language, like
articles of clothing worn or gestures performed—
anything for which we can define a probability
distribution over types and hence a proxy for in-
terpretive effort.
Finally, our framework is easily
extensible to more complex models of phrase and
signal generation, to hierarchical structure in the
codebook, and so on, so long as the model assigns
a probability to the appropriate semantic units.
Data and Methods
To illustrate our approach, we measure the cul-
tural holes between scholarly cultures or fields in
JSTOR. Recall that our method requires both
To our knowledge, the closest alternative is (Livne
et al., 2011), which uses the symmetrized Kullback–Leibler
divergence to measure the semantic similarity of Twitter-
using politicians. Although this shares our information-
theoretic orientation, in our context, the cross entropy is
easier to interpret
To make this more plausible, consider the difficulty
of correctly interpreting conventional gestures outside of
their cultural context—Google “peace sign in Australia.”
the identification of “fields” and the assembly of
“codebooks” that capture the culture of a field
through its probability distribution over phrases.
Fields are identified based on patterns of citation
between articles using the map equation (Ros-
vall and Bergstrom, 2008). To assess culture, we
assembled cataloges of phrases and their field-
specific frequency distributions using full-text tri-
grams (distinct three word combinations) drawn
from a representative subsample of articles in
each field written between 1990 and 2010. These
frequency distributions serve as the codebooks
in our model of scholarly communication. In
this section, we provide technical details on field
identification and codebook creation. We also
describe our procedures for measuring distance
between fields and visualizing cultural holes as
chasms on a citation map.
Field Identification
We studied 60 large scientific fields in the JSTOR
citation network. This network includes more
than 1.5 million interconnected articles. We iden-
tified fields using the map equation (Rosvall and
Bergstrom, 2008), an algorithm for extracting
the community structure of complex networks
(Fortunato, 2010; Lancichinetti and Fortunato,
The map equation has been used extensively
to extract scientific fields in citation networks,
e.g., (Rosvall and Bergstrom, 2008). In our case,
the map equation tracks scholarly citation flow in
the JSTOR corpus. It partitions the articles into
fields to minimize the description of an idealized
researcher who navigates from article by article
by following citations at random. The fields are
the regularities that best compress the citation
flow; in this sense, they are optimal. In practice, a
field corresponds to a set of articles among which
the idealized researcher would spend a long time
before transitioning to another field.10
Because citation networks are time directed, however,
this idealized random walk approach can cause older ar-
ticles to accumulate flow disproportionately. To resolve
this, we used the undirected version of the citation net-
work to infer the stationary distribution of the random
walk (removing the time–directionality and the consequent
problem of citation sinks). We then evaluated the quality
of proposed partitions using the directed network.
sociological science | 225 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Selecting the Sample and Naming
To assure that conclusions about the culture of a
given scientific field were well–founded, we first
eliminated from the sample any field with fewer
than 1,000 papers. This assures that the field
in question is well–covered by JSTOR. We fur-
ther restricted our sample to papers published
between 1990 and 2010, to focus our analysis
on contemporary science; any fields with fewer
than 500 papers within this 20-year window were
discarded. These sampling steps resulted in 78
fields. The logic behind our sampling procedure
is straightforward: we wanted to make sure that
we had enough papers from a field to construct
reasonable proxies of its contemporary scientific
culture. Note that scholarly fields vary in their
representation in JSTOR. This heterogeneous
coverage is responsible for the absence of many
prominent scientific disciplines, such as physics,
from our analysis.
Field names were then chosen manually, af-
ter identifying the phrases that best distinguish
each cluster by measuring the mutual information
between phrase
and cluster
(Manning et al.,
2008). A list of the “most distinguishing phrases”
for each cluster is available as supplementary
Because of the computational costs of calcu-
lating cultural holes, we further reduced the size
of the sample from 78 to 60. These 60 fields were
chosen to maintain balance across the major do-
mains of scholarship in JSTOR. In particular, we
retained every field in statistics and molecular bi-
ology. In ecology and evolution, we selected fields
across subdomains to maximize the diversity of
our sample.11
The phrase frequency distribution
for each
scholarly field (its “codebook”) was assembled
using the empirical frequency of each triplet of
consecutive words (trigram) in a random sample
As discussed in footnote 16, it is unlikely that corpus
representation of fields and their neighbors substantially
affects our analysis. Missing fields would have to deviate
substantially and systematically from what we observe to
produce qualitative changes in our results. We are thus
confident that our results are robust for scholarly domains
and fields that are well–represented in JSTOR.
of 500 articles in that field, published between
1990 and 2010. We chose 500 because it was
the largest number of articles that we could sam-
ple from every field in the study period; some
fields had just over 500 articles from 1990 to
2010, making larger samples impossible. Samples
of approximately 100 articles yielded a consis-
tent field-level entropy, so we are confident that
a larger 500–article sample is sufficient for our
In computational linguistics, a statistical lan-
guage model assigns a probability to a sequence of
w1, . . . , wm
)by means of a probability
distribution. Language models built on trigrams
are especially reliable; they capture substantial
complexity (syntactic structure and conditional
probabilities between words) while being easy to
implement. In many cases, more complicated
language models reveal little additional informa-
tion (Jelinek, 1991). We removed some function
and stop words from the trigrams to decrease the
overall number of terms in the database.
procedure yields an augmented trigram language
model, including bigrams and single words when
flanked with stop words.13
To apply our information-theoretic model of
scholarly communication (see “Using information
theory to model scholarly culture and communi-
cation”) to real data, we need the codebooks for
field iand jto contain the same phrases
]) so that a concept from
can always be expressed in the codebook
of field
. To ensure this, we introduce a tele-
portation parameter
, which merges a discipline
with the “corpus codebook”
erated from the articles of every field:
i(x) = (1 α)pi(x) + αs(x)
j(x) = (1 α)pj(x) + αs(x),
where the lowercase
indicate the proba-
bility distribution function over phrases for the
appropriate codebooks. Intuitively, this means
that with some small probability
, the writer
Deleted words include a,another,behind,no,none,
something,such,than,that,wherever,will. A complete list
of these removed words, and more detailed methodology,
are available in the supplementary material.
Using an augmented trigram model will, in principle,
improve our ability to detect language differences across
fields when some fields use trigram phrases (e.g., green
fluorescent protein) more frequently.
sociological science | 226 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
(or reader) consults not the discipline codebook
but the corpus codebook
. The value of this
parameter does not change our results, so we
chose the intuitive value of 1 percent.
Measuring Distance in the Citation
We measure distance in the citation network be-
tween fields
as follows. For randomly
selected pairs of articles, one in field
and one in
, we calculate the number of links in the
shortest path between them. This topological dis-
tance is computed using the undirected version
of the citation network. The average value of
this quantity is an estimate of the average length
of the shortest path between these two fields.
This estimated average path length provides our
distance measure.
Visualizing Cultural Holes
To construct the topographical map in Figure 3,
we first embed the 60 fields in two dimensions
using principal coordinates analysis (PCoA), that
is, multidimensional scaling, as implemented in
R (Borg and Groenen, 2005). We define the
elements of the dissimilarity matrix using the
average shortest path distance between two fields
in the citation network. We refer to the
dinates of field
. Cultural canyons
were calculated as the distance-weighted sum
of pairwise, symmetrized cultural holes
(Cij +Cji )/2.
The underlying logic is as follows. The height
of each pixel in the map should be more affected
by nearby fields. Thus we define a weight vector
for pixel
, at position
and position
. The components of this vector determine how
much a given field
contributes to the pixel height
at position x, y:
Pxy =q(xFi
x)2+ (yFi
is a parameter controlling how rapidly
the influence of a given field falls off with distance.
For Figure 3, we used
= 1, so the weight is
just the inverse distance. Finally, we exclude the
nearest field (
)) from calculations of the
pixel height; inclusion of the nearest field leads
to “defects” in the topographical map centered on
each field (the sum is dominated by the self-hole
= 0). The height of the pixel
is then
given by
Pxy =X
wPxy )X
wPxy ),j6=i
Pxy wj
Cij .
Results: Analyzing Scholarly
Fields in JSTOR
We now turn to the analysis of our JSTOR corpus.
We find systematic patterns in the distribution of
cultural holes across the major domains of science
cataloged in JSTOR. Biological sciences (includ-
ing ecology, evolutionary, and molecular biology)
are surrounded by deeper cultural holes on aver-
age (i.e., higher values of
) than behavioral and
social sciences, such as psychology, economics, so-
ciology, political science, business, religious stud-
ies, and education (T-test,
P <
; see Figure
Furthermore, the social sciences are more ac-
cessible to the biological sciences than vice versa.
is a social science field and
is a biological
science field, then
Cij < Cji
on average, that
is, the cultural hole encountered by a biological
scientist reading a social science paper tends to
be shallower than for a social scientist reading a
biological science paper (
Cij < Cji
for 893 of 899
pairs; significant, see supplemental material for
details of statistical test).
Note that this pattern does not follow trivially
from the number of distinct phrases or technical
terms (Fig. 1B). The number of distinct phrases
is field- rather than domain-specific, with some
fields from each domain having few terms and
some having many. We find, however, that the
ratio of distinct three-word phrases to distinct
words does vary systematically by domain. So-
cial science fields tend to use fewer words in more
combinations, generating many distinct phrases
from their stock of words. The ratio is small for
biological science fields, suggesting that the con-
straints on word combination are stronger there
and that fields with many distinct phrases have
many distinct words (Fig. 1C). This domain-
specific asymmetry may provide a partial expla-
nation for the domain-level asymmetry in cultural
holes. Distinct phrases from the biological sci-
sociological science | 227 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Figure 1: (A)
Average cultural hole
PjCij /N
across fields. Note that biological sciences are
surrounded by systematically larger cultural holes than social sciences.
Total number of phrases
(distinct three word combinations/trigrams) in each codebook. Note that there is no pattern at the
domain level, that is, biological sciences (blue) and behavioral/social sciences (orange) do not vary
systematically in the number of phrases.
Number of phrases per distinct word in each field.
Note the systematic difference between biological sciences (blue) and behavioral sciences (orange).
This pattern suggests that social science fields use fewer words in more combinations, whereas word
combination in the biological sciences is more constrained.
sociological science | 228 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
ences are likely to contain many words poorly
represented in the social science codebooks.
Second, we find that the structure of scientific
communication as traced by citations is related to
but distinct from the structure traced by cultural
or semantic difference. In other words, struc-
tural and cultural holes do not precisely align.
Hierarchical clustering analysis (UPGMA) of the
average shortest citation path distances between
fields yields a dendrogram that splits the biologi-
cal from the social sciences at the domain level
(Fig. 2A) (Sokal, 1958); see “Data and Meth-
ods” for the estimation of citation path distance.
When these fields are clustered by cultural hole,
however, the dendrogram does not reflect these
major domains (Fig.2B). We perform this cluster-
ing using a symmetrized version of the cultural
hole measure:
= (
2. Now, the
subfields broadly corresponding to ecology and
evolution are separated from the molecular fields
(Figs. 2A and 2B, molecular fields shown in red).
This separation reveals a deep cultural hole within
biology that cuts across the flow of the citation
network. In economics, a smaller cultural hole is
revealed: growth economics and consumer theory
are clustered with portfolio theory, option pric-
ing, and time series analysis by citation (Fig. 2A,
shown in blue). When clustered by the cultural
hole measure, however, they group with subfields
having to do with labor (Fig. 2B, also blue).
These exploratory findings suggest the need
to combine structural information (from cita-
tions) and cultural information (from full-text)
systematically. To visualize these cultural holes
as they cut across the traditional citation-based
map of science, we embed scholarly fields in a
two-dimensional topography defined by citation
and jargon (Fig. 3). The relative location of
fields is determined by the citation flows between
them. Fields with substantial citation flow are
placed close together, and those with little flow
are placed far apart. The topographical features
of this landscape, in turn, are determined by
More specifically, the
of each field are assigned via principal coordi-
nate analysis of the citation distances. We mea-
sure dissimilarity between fields using the average
shortest citation path between them (goodness
of fit = 0.25) (Borg and Groenen, 2005). The
depth of each pixel in the topographic overlay is
a weighted average of the symmetrized cultural
holes between nearby fields; see “Data and Meth-
ods” for details of map construction, including
the dissimilarity matrix for PCoA and the weight-
ing function. Deep chasms represent significant
cultural holes. Researchers in fields separated by
such holes must invest substantial resources to
translate their neighbors’ articles and incorporate
them into work within their field. Wading into
another literature with substantial jargon will
literally take the reader under water.
This visualization exposes several key features
of scientific communication. First, maps of sci-
ence based purely on the structure of citations are
missing a large part of the story—just like maps
of society based purely on social ties. The nu-
merous cultural holes crosscutting this landscape
make it clear that the efficient flow of informa-
tion assumed by classical citation analysis is often
impeded by jargon. Second, the large-scale struc-
ture of the map makes sense. The social sciences
cluster together on the left, the biological sciences
on the right. At least in JSTOR, statistics sits
between the social and the biological sciences,
reflecting its role as a common resource. Interest-
ingly, the cultural hole between statistics and the
social sciences is shallower than between statistics
and the biological sciences. Within the social sci-
ences, there is a relatively clear path, with small
cultural holes, from education to psychology to
sociology to economics and business. This re-
flects the relative coherence of the social sciences
in terms of jargon and, by extension, matters
of common concern. In the biological sciences,
by contrast, many modest cultural holes sepa-
rate clusters of fields, with a more substantial
chasm between the ecological sciences (bottom
right) and genetics, phylogenetics, and systemat-
ics (middle right). Molecular biology fields (upper
right) are far from the rest of biology in both ci-
tation distance and jargon, with massive cultural
holes cutting molecular biology off from nearly
every discipline in JSTOR.
Visual inspection of our topographical map
suggests that the landscape is more rugged in the
biological sciences than in the social sciences. The
biological sciences are more balkanized by jargon
and hence have more differentiated local cultures,
which are reflected by the terms of interest from
their articles. To make this intuition precise, we
exploit the fact that the cultural hole between
sociological science | 229 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
 !"
! !
 "
) 
 &"
*) )
" (
 "
  
! &
-' 
 '"
 &" 
' )
')'))& 
&)) &'"
  
./- !
&" 
&'"+ 
+" 
*)' 
 &'+'"
' )
! !
 "
 !"
&" 
&'"+ 
 &'+'"
*)' 
)" ")
) 
&)) &'"
')'))& 
  
 &"
*) )
 "
-' 
 '"
  
! &
 &" 
 !"
! !
 "
) 
 &"
*) )
" (
 "
  
! &
-' 
 '"
 &" 
' )
')'))& 
&)) &'"
  
./- !
&" 
&'"+ 
+" 
*)' 
 &'+'"
' )
! !
 "
 !"
&" 
&'"+ 
 &'+'"
*)' 
)" ")
) 
&)) &'"
')'))& 
  
 &"
*) )
 "
-' 
 '"
  
! &
 &" 
Figure 2: (
A) Clustering the fields by shortest citation path. Note that the biological sciences cluster together on the left and the social and
behavioral sciences cluster together on the right. (B) The biological sciences do not cluster together by symmetrized cultural hole
. The
molecular and cell biology subfields (red) are distant from the other biological science fields and the social science fields by jargon. Similarly,
economic fields clustered by citation (blue) are grouped with other fields by jargon. Growth economics and consumer theory cluster with
portfolio theory, option pricing, and time–series analysis by citation but with labor related fields (gender and labor; unemployment) by jargon.
All clustering was performed using UPGMA, but the result holds for different clustering methods.
sociological science | 230 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Figure 3:
Topographical map of science combining textual and citation data. Fields that are close
communicate frequently: positions in space are calculated by applying principal coordinate analysis to
the matrix of shortest average citation paths and retaining the first and second principal coordinates.
In this contour map, “oceans” (green shading to blue) represent the (negative of the) distance-weighted
sum of symmetrized cultural holes between fields,
; see “Data and Methods” for weighting function.
Despite substantial citation flow between fields separated by these holes (e.g., survival analysis
and medical outcomes), communication is inefficient. Note that social sciences cluster together on
the left–hand side and biological sciences on the right–hand side, with statistics located between.
Note also that molecular biology fields (upper right) are separated from other fields, including other
biological sciences, by huge cultural holes, and that cultural holes sensibly separate the remaining
fields (smaller panels). In the small panels, labels have been shifted slightly to reduce overlap and
increase legibility.
and field
grows with the average short-
est citation path between them. Equivalently,
the efficiency
= 1
decays with cita-
tion distance. To control for possible differences
in citation practices between fields or domains,
we normalize our measure of citation distance.
Specifically, we divide the average shortest path
from an article in field
to one in field
by the
average shortest path between two articles in field
. We then model the decay in communication
efficiency with citation distance as
Eij = 1 β(1 eγ dij ),(4)
is the normalized citation distance, 1
gives the asymptotic efficiency for fields at infi-
nite distance, and
controls the decay of
with distance. Fits were obtained via nonlinear
least squares in R. Figure 4A shows the field
with the slowest decay (option pricing) and the
field with the fastest decay (environmental toxi-
cology). These examples suggest the conditions
under which slow and fast decay occur. Fields
with slow decay have substantial citation flow to
several neighboring fields and relatively efficient
communication with those fields, that is, small
cultural holes. Fields with fast decay, by con-
trast, may have several fields at similar proximity
but much less efficient communication with these
fields, that is, deep semantic chasms. The decay
rate γvaries across fields (Fig. 4B).
This analysis suggests that the topographical
features observed in Figure 3 are not an arti-
fact of embedding the high-dimensional citation
network in two dimensions. We find that the de-
cay rate is higher in the biological sciences than
the social sciences (Fisher’s exact test: top half
vs. bottom half,
P <
0001); in other words,
the landscape of cultural holes in the biological
sciences is indeed more rugged. There are a num-
ber of exceptions to this pattern, however; most
surprisingly, several fields related to molecular
biology have relatively small values of
, so that
efficiency of communication falls off slowly with
citation distance. These counterexamples further
illustrate that decay rate is not systematically re-
lated to the overall amount of jargon in a specific
sociological science | 231 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Figure 4: (A)
Efficiency of communication from focal field
to target field
= 1
) decays with
the distance to field
differently for different fields. Here we show the slowest decay (option pricing,
dashed line) and the fastest decay (environmental toxicology, solid line) out of the 60 fields (see
supporting information for others).
Decay rate
plotted for behavioral science fields (orange)
and biological science fields (blue). Focal fields with fast decay tend to have few fields nearby with
which communication is efficient. Those with slow decay have several neighbors with relatively small
cultural holes, that is, efficient communication. Distance is computed using the normalized shortest
path: the average shortest path from a paper in field
to a paper in field
divided by the average
shortest path from a paper in field
to another paper in field
. We subtract 1from this value so
that the normalized shortest path from a field to itself is 0. This normalization allows us to account
for differences in citation norms that cause focal fields to be tightly or loosely connected, that is, to
have a short or long average path distance within field.
field and hence the depth of the average cultural
hole around it. Although the molecular biology
fields have the most jargon in JSTOR (see Fig.
1A), decay rates for several of these fields are
comparatively low, while two are very high (HIV
and environmental toxicology).14
Our results suggest that combining the structural
analysis of citation flows with explicit models of
communication processes (e.g., jargon-induced
cultural holes) exposes important features of schol-
arly communication. We further suggest that the
Similarly, the decay rate is not systematically related
to number of distinct terms (Fig. 1B) for example, HIV
and plant pathogens have a similar number of distinct
terms (and similar average cultural holes) but wildly dif-
ferent decay rates.
interaction of the structural and cultural dimen-
sions of scholarly communication reveals previ-
ously neglected social processes. For example,
we argue that the decay of communicative effi-
ciency with citation distance reflects the relative
insularity of scholarly cultures or fields. Fields
with faster decay rates are less accessible to others
close by. Scholars working in these fields make ex-
tensive use of jargon not shared with neighboring
fields, creating cultural holes that make inter-
field communication less efficient. Readers from
neighboring fields are sufficiently aware of this
nearby knowledge to reference it but can only un-
derstand it through potentially prohibitive study
and decoding. By contrast, scholars in fields
with slow decay rates likely use jargon that is
shared with neighboring fields—their probability
distribution over phrases is similar—making their
work much more accessible through phrases of
common concern. The absence of cultural holes
sociological science | 232 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
makes communication between these scientific
cultures much easier.
This varying relationship between shortest ci-
tation paths and communicative efficiency helps
us interpret between-field differences in the aver-
age depth of cultural holes,
. Molecular and
cell biology fields have the largest cultural hole
in JSTOR (see Fig. 1A). The decay
rate of some of these fields
is quite slow, how-
ever, in comparison with other fields in the social
and biological sciences. This surprising combi-
nation suggests that molecular and cell biology
does not have substantial jargon—and, conse-
quently, deep cultural holes—because it is insular
or exclusionary, per se. Slow decay rates exclude
this interpretation at the field level. Indeed, the
three cell biology fields cluster close together in
citation distance and have relatively shallow cul-
tural holes between them, reflecting a substantial
overlap in matters of interest. Instead, molecular
biology fields have high jargon simply because
they are remote from most other fields in JSTOR,
both in citation distance and semantic distance,
that is, matters of concern. By contrast, vole
research is close to many fields by citation (in
the ecology cluster), but its communicative ef-
ficiency decays quickly as a function of citation
distance. Voles are small, mouselike rodents, and
while vole research has only a moderate amount
of jargon overall, it overlaps little in matters of
interest with its neighbors (e.g., bat and bear
research). Vole research is thus surrounded by
a moat of jargon and is much more insular than
many molecular biology fields. These cases il-
lustrate that decay rate can be used to assess
the relative insularity of different fields, taking
into account the distance between fields in the
citation network as measured by shortest path.
These examples suggest an interesting inter-
pretation of the continuities and discontinuities
between fields of science and scholarship in JS-
Research on plant pathogens, the extracellular matrix,
the cytoskeleton, and membrane cell biology.
It is unlikely that these patterns in
are an artifact of
corpus representation of the focal field and its neighbors.
Social science fields and ecological science fields both have
many close neighbors in the corpus yet significantly dif-
ferent typical values of
. It is possible that the relatively
slow decay rates for molecular and cell biology fields could
be affected by the undersampling of potential “middle-
distance” neighbors, but the efficiency of communication
to these missing neighbors would have to differ radically
from sampled fields to shift the estimate of
TOR. First, JSTOR fields fall into four great
camps: the social sciences; the ecological and
evolutionary branches of biology; the molecular
and cellular branches of biology; and statistics.
Concentrating on the substantive fields (social,
ecological, and molecular), we note that each of
these clusters, tied closely together by citation
flow, is also united by concern with a particu-
lar scale (macroscopic and human; macroscopic
and nonhuman; microscopic). The social sciences
tend to have more shallow cultural holes, on aver-
age, and can communicate quite efficiently with
neighboring social sciences. This suggests that
these fields are not absorbed by particularities
but instead share many matters of concern, re-
maining relatively integrated with one another.
The ecological sciences, by contrast, have deeper
cultural holes and communicate inefficiently with
neighboring ecological fields. The example of vole
research suggests a broader principle: these fields
are absorbed in many particularities that they do
not share with one another (I am concerned with
voles; you with bears; she with bats).
the molecular biological sciences are surrounded
by deep cultural holes, but many communicate
efficiently with their immediate neighbors. This
reflects an orientation toward many shared partic-
ularities: the molecules and processes that form
the physical substrate of life. It is especially inter-
esting that the ecological sciences—popularly as-
sociated with holistic, anti–reductionist thinking—
are so balkanized in their matters of concern,
while the molecular sciences, despite dividing life
into so many distinct building blocks, neverthe-
less seem more integrated at the semantic level.
A deeper explanation of these patterns is beyond
the scope of this paper, but note that none of
this would be apparent from a pure citation or
semantic analysis alone. Note also that all as-
pects of our formalism are portable to any other
type of data involving structural relationships
and cultures revealed in language or signs.
Information theory provides a simple but pow-
erful framework with which to model communi-
It is possible that these cultural holes may be some-
what attenuated by analogical, thesaurus-like mappings
(e.g., voles and bats are both small mammals), but these
mappings are likely to be limited in scope.
sociological science | 233 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
cation and measure cultural holes. We demon-
strated its utility through the analysis of scientific
communication, a central project in the science of
science (de Solla Price, 1965). Qualitative stud-
ies have considered content of communication
in conjunction with citations (Ceccarelli, 2001),
but the vast majority of quantitative analyses
rely exclusively on citations to map the structure
and flow of scientific communication (Rosvall and
Bergstrom, 2008). In cases where content is ad-
dressed directly, it is viewed as a substitute for
citation analysis (Landauer et al., 2004; Gerrish
and Blei, 2010), or citation patterns are used
to construct a measure of semantic similarity
(Moody and Light, 2006), or citations are treated
as another type of content (Erosheva et al., 2004).
Livne et al. (2011) is an important exception,
although focused on a distinct literature, that is,
social media and politics.
In this paper, we have demonstrated that
scholarly semantic information is not a substitute
for citation analysis. Rather, the two sources of
information are complementary, together reveal-
ing patterns that otherwise remain invisible. We
introduced an easily understood measure of cul-
tural and semantic distance, grounded in a simple
model of communication. We then showed that
the social and biological sciences differ systemat-
ically in their use of jargon and the patterning of
associated cultural holes. We demonstrated that
science takes on a different shape when viewed
through citation or content alone and introduced
two procedures for combining such information:
first, an attractive and easily interpretable “to-
pographical map” of science, in which citation
structure establishes location and jargon estab-
lishes topography; and second, a rigorous proce-
dure for capturing how communication efficiency
scales with citation distance. Using these two
procedures, we made several surprising discov-
eries about the structure of the scholarly fields
in JSTOR: the coherence of the social sciences;
the balkanization of ecological sciences by jargon;
and the surprising coherence of molecular and cell
biology in the face of its isolation from the other
fields in our sample. Moreover, though we pilot
this approach by analyzing the structure and cul-
ture of scientific communication, we believe that
it can be straightforwardly applied to any system
involving a network of structural relationships
and a distribution of cultural symbols, signals, or
Our analysis of scientific communication has
marked limitations. Sampled trigrams (three-
word phrases) map imperfectly to the phrases
of most interest to scientists and scholars. Cur-
rently we do not distinguish different types of
phrases or mark out those of special scientific
importance. Note, however, that using a more
sophisticated language model, for example, one
taking syntactic information into account, will in
general refine but not substantially alter our cul-
tural hole measurements and subsequent conclu-
sions. Under our core assumption that frequently–
encountered linguistic units are easy to under-
stand and infrequently–encountered ones are hard
to understand, the primary contribution to cul-
tural holes between fields will still come from
phrases that are used frequently in one field and
rarely in another (capital asset pricing,Gibbs
sampler,microtine cycles,vesicular stomatitis
virus). A more sophisticated language model
might avoid splitting phrases that are longer than
three words, for example, Markov Chain Monte
Carlo, but this is essentially a difference in ac-
counting. Such a phrase will still contribute under
a trigram model, and because cultural holes are
computed as ratios, the actual values are unlikely
to change significantly. Likewise, a more sophisti-
cated language model might infer deeper cultural
holes between fields in which the same phrase is
used robustly in different grammatical roles, but
such situations are hard to imagine. When the
same phrase is used in different semantic contexts,
it might reasonably count as distinct and there-
fore deepen the cultural hole, but this argument
suggests that our measure is in general a lower
bound ripe for subsequent refinement.
Thus we
are confident that our augmented trigram model
provides a robust estimate of the cultural holes
between two fields, given our general model of
culture and communication.
More importantly, we cannot currently say
why jargon is adopted: when it was introduced
to maximize communicative efficiency within a
In principle, we might underestimate the cultural
holes between social science fields because we are fail-
ing to capture substantial differences in the context in
which shared phrases are used. Such differences are un-
likely in immediate neighbors, however, and thus unlikely
to contribute to unmeasured balkanization in the social
sociological science | 234 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
field without regard to outside audiences (Kemp
and Regier, 2012; Fawcett and Higginson, 2012;
Zipf, 1935, 1949), and when it was used to dis-
tinguish the field and excavate a cultural hole
that limits oversight, association and loss of sta-
tus (Bourdieu and Thompson, 1991; Coleman,
1985; Holmes and Meyerhoff, 1999). Neverthe-
less, we believe that our investigation sheds sub-
stantial new light on jargon’s distribution and
its consequences for science. Future work that
tags phrases and citations with temporality, au-
thorship, institution, and semantic class (e.g.,
methods) may begin to disentangle the origins
of jargon, the role of cultural holes in sustaining
structural holes, and the circumstances in which
actors can fill in cultural holes through intercul-
tural communication. Similarly, research that
identifies and interrogates the emergence of new
symbols, signs, or phrases may be able to identify
when efficiency or distinction is a primary motive
for the creation of cultural holes.
More broadly, our findings point to a new
research program in the analysis of culture and
the science of science. On the empirical front,
scholars now have a lean machinery to identify
cultural holes and assess the efficiency of commu-
nication between communities as well as their rel-
ative insularity. On the methodological front, our
framework can be extended to deal with complex
syntactical rules and richer models of communica-
tion. Catalogs of technical terms can be expanded
and structured to include term redundancy, hier-
archy, and syntactic structure. Moreover, these
lists can be broadened to include signals and sym-
bols beyond written language. Communication
rules can also be altered to capture the real social
and cognitive processes through which scholars
read and assess documents and people in other
cultures evaluate messages of all types.19
Though our analysis highlights major pat-
terns in JSTOR and science at large, we have
barely scratched the surface of this landscape of
possibility. Cultural holes constantly evolve as
collaborative dynamics change. What happens to
field-specific jargon when two cultures merge? Do
changes in citation drive changes in jargon, or vice
On the theoretical front, we also note that our opera-
tionalization of Pachucki and Breiger’s concept of cultural
holes suggests intriguing connections between information
theory, communication theory, and cultural sociology as
well as the ethnomethodological perspective.
versa? Do jargon and other semiotic markers fol-
low standard evolutionary birth–death dynamics,
producing the culture-level patterns we observe—
or are they subject to additional social processes?
Our results underline both the exploding oppor-
tunities in the large-scale structural analysis of
culture (Bail, 2014) and the importance of build-
ing and interconnecting new models and sources
of information as we seek to quantify, understand,
and shape behavior in science (Evans and Foster,
Bail, Christopher A. 2014. “The Cul-
tural Environment: Measuring Culture
with Big Data.” Theory and Society 43.
Bearman, Peter and Paolo Parigi. 2004. “Cloning
Headless Frogs and Other Important Matters:
Conversation Topics and Network Structure.”
Social Forces 83:535–57.
Bernstein, Basil. 1964. “Elaborated and Re-
stricted Codes: Their Social Origins and
Some Consequences.” American Anthropolo-
gist 66:55–69.
Bischof, Nicole and Martin J. Eppler. 2010. “Clar-
ity in Knowledge Communication.” In Pro-
ceedings of the Tenth International Knowledge
Management Conference IKnow, volume 10,
pp. 162–174. Verlag der Technischen Univer-
Borg, Ingwer and Patrick J. F. Groenen. 2005.
Modern Multidimensional Scaling: Theory and
Applications. New York: Springer.
Bourdieu, Pierre and John B. Thompson. 1991.
Language and Symbolic Power. Cambridge
MA: Harvard University Press.
Boyack, Kevin W., Richard Klavans, and Katy
Börner. 2005. “Mapping the Backbone of Sci-
ence.” Scientometrics 64:351–74.
sociological science | 235 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Burt, Ronald S. 1992. Structural Holes: The
Social Structure of Competition. Cambridge
MA: Harvard University Press.
Ceccarelli, Leah. 2001. Shaping Science
with Rhetoric: The Cases of Dobzhansky,
Schrodinger, and Wilson. Chicago: Univer-
sity of Chicago Press.
Coleman, Hywel. 1985. “Talking Shop: An
Overview of Language and Work.” Inter-
national Journal of the Sociology of Lan-
guage 1985:105–30.
Cover, Thomas M. and Joy A. Thomas. 2006.
Elements of Information Theory. Hoboken:
de Condillac, E. B. 1782. Cours d’étude pour
l’instruction du Prince de Parme. Paris: Houel.
de Solla Price, D. J. 1965. “Networks of Scientific
Papers.” Science 149:510–15.
Erickson, Bonnie H. 1996. “Culture, Class,
and Connections.” American Journal of So-
ciology 102:217–51.
Erosheva, Elena, Stephen Fienberg, and John
Lafferty. 2004. “Mixed-Membership Models of
Scientific Publications.” Proceedings of the Na-
tional Academy of Sciences 101:5220–27.
Evans, James A. and Jacob G. Foster. 2011.
“Metaknowledge.” Science 331:721–25.
Fawcett, Tim W. and Andrew D. Higginson.
2012. “Heavy Use of Equations Impedes
Communication among Biologists.” Proceed-
ings of the National Academy of Sciences
Feldman, Robin. 2008. “Plain Language Patents.”
SSRN Scholarly Paper ID 1731651, Social Sci-
ence Research Network, Rochester, NY.
Fortunato, Santo. 2010. “Community De-
tection in Graphs.” Physics Reports
Friedland, Roger. 2009. “The Endless Fields of
Pierre Bourdieu.” Organization 16:887–917.
Garfinkel, Harold. 1991. Studies in Ethnomethod-
ology. Hoboken NJ: John Wiley.
Gerrish, Sean and David M. Blei. 2010. “A
Language-Based Approach to Measuring Schol-
arly Impact.” In Proceedings of the 26th In-
ternational Conference on Machine Learning,
June 21–24 , pp. 375–382.
Han, Shin-Kap. 2003. “Unraveling the Brow:
What and How of Choice in Musical Pref-
erence.” Sociological Perspectives 46:435–
Holmes, Janet and Miriam Meyerhoff. 1999. “The
Community of Practice: Theories and Method-
ologies in Language and Gender Research.”
Language in Society 28:173–83.
Homans, George Caspar. 1961. Social Behavior:
Its Elementary Forms. New York: Harcourt,
Brace and World, Inc.
Jelinek, Fred. 1991. “Up from Trigrams.” In
Proceedings of Second European Conference on
Speech Communication and Technology, EU-
ROSPEECH , volume 91, pp. 1037–40. Genova,
Italy: September 24–26.
Kemp, Charles and Terry Regier. 2012. “Kin-
ship Categories across Languages Reflect Gen-
eral Communicative Principles.” Science
Knorr-Cetina, K. 1999. Epistemic Cultures: How
the Sciences Make Knowledge. Cambridge, MA:
Harvard University Press.
Kullback, Solomon and Richard A Leibler. 1951.
“On Information and Sufficiency.” Annals of
Mathematical Statistics 22:79–86.
Lancichinetti, Andrea and Santo Fortunato.
2009. “Community Detection Algorithms:
A Comparative Analysis.” Physical Review
sociological science | 236 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Landauer, Thomas K., Darrell Laham, and
Marcia Derr. 2004. “From Paragraph to
Graph: Latent Semantic Analysis for Infor-
mation Visualization.” Proceedings of the Na-
tional Academy of Sciences of the United States
of America 101:5214–19.
Leydesdorff, Loet and Ismael Rafols. 2008. “A
Global Map of Science Based on the ISI Sub-
ject Categories.” Journal of the American
Society for Information Science and Technol-
ogy 60:348–62.
Livne, Avishay, Matthew P. Simmons, Eytan
Adar, and Lada A. Adamic. 2011. “The Party
Is Over Here: Structure and Content in the
2010 Election.” In Proceedings of 5th Interna-
tional AAAI Conference on Weblogs and Social
Media. Barcelona, Spain.
Lizardo, Omar. 2006. “How Cultural Tastes Shape
Personal Networks.” American Sociological
Review 71:778–807.
Manning, Christopher D., Prabhakar Raghavan,
and Hinrich Schütze. 2008. Introduction to
Information Retrieval, volume 1. Cambridge:
Cambridge University Press.
Moody, James and Ryan Light. 2006. “A
View from Above: The Evolving Soci-
ological Landscape.” American Sociolo-
gist 37:67–86.
s12108--006--1006- -8.
Pachucki, Mark A. and Ronald L. Breiger. 2010.
“Cultural Holes: Beyond Relationality in Social
Networks and Culture.” Annual Review of
Sociology 36:205–24.
Reach, G. 2009. “Linguistic Barriers in Diabetes
Care.” Diabetologia 52:1461–63.
Richardson, Matthew L. 2010. “Publishing Sci-
entific Outreach Materials in Educational and
Social Science Journals.” American Entomolo-
gist 56:11–13.
Rosvall, Martin and Carl T. Bergstrom. 2008.
“Maps of Random Walks on Complex Net-
works Reveal Community Structure.” Pro-
ceedings of the National Academy of Sci-
ences 105:1118–23.
Shannon, Claude E. 1948. “The Mathe-
matical Theory of Communication.” The
Bell Systems Technical Journal 27:379–
423, 623–56.
Small, Henry. 1999. “Visualizing Science by
Citation Mapping.” Journal of the Amer-
ican Society for Information Science and
Technology 50:799–813.
Snow, C. P. and Stefan Collini. 2012. The Two
Cultures. Cambridge MA: Cambridge Univer-
sity Press.
Sokal, Allan and Jean Bricmont. 1998. Fash-
ionable Nonsense: Postmodern Intellectuals’
Abuse of Science. London: Picador.
Sokal, Robert R. 1958. “A Statistical Method for
Evaluating Systematic Relationships.” Univer-
sity of Kansas Scientific Bulletin 38:1409–38.
Sonnett, John. 2004. “Musical Boundaries: In-
tersections of Form and Content.” Poet-
ics 32:247–64.
Tavory, Iddo and Ann Swidler. 2009. “Con-
dom Semiotics: Meaning and Condom Use
in Rural Malawi.” American Sociological
Review 74:171–89.
Vaisey, Stephen and Omar Lizardo. 2010. “Can
Cultural Worldviews Influence Network Com-
position?” Social Forces 88:1595–1618.
Xiao, Zhixing and Anne S. Tsui. 2007. “When
Brokers May Not Work: The Cultural Contin-
gency of Social Capital in Chinese High-Tech
Firms.” Administrative Science Quarterly 52:1–
sociological science | 237 June 2014 | Volume 1
Vilhena, Foster, Rosvall, West, Evans, and Bergstrom Finding Cultural Holes
Zipf, George K. 1935. The Psycho-biology of
Language. Boston, MA: Houghton Mifflin.
Zipf, George K. 1949. Human Behavior and the
Principle of Least Effort. Cambridge, MA:
This work was supported in
part by NSF grant SBE-0915005 to CTB, a
WRF-Hall research fellowship to DAV, and
Swedish Research Council grant 2012-3729 to
MR. We thank JSTOR for a generous gift and
for processing the initial data for this project.
We thank Jake Fisher, Mark Mizruchi, Jim
Moody, Gabriel Rossman, and Lynne Zucker
for their comments on earlier drafts. Direct
correspondence to Jacob G. Foster.
Daril A. Vilhena:
Department of Biology, Univer-
sity of Washington. E-mail:
Jacob G. Foster:
Department of Sociology, Univer-
sity of California—Los Angeles. E-mail: fos-
Martin Rosvall:
Department of Physics, University
of Umeå. E-mail:
Jevin D. West:
Information School, University of
Washington. E-mail:
James Evans:
Department of Sociology, Univer-
sity of Chicago. E-mail:
Carl T. Bergstrom:
Department of Biology, Univer-
sity of Washington. E-mail: cbergst@
sociological science | 238 June 2014 | Volume 1
... It is especially prevalent in scholarly writing, where researchers use a rich repertoire of lexical choices to communicate. However, niche vocabularies can become a barrier between fields (Vilhena et al., 2014;Martínez and Mammola, 2021;Freeling et al., 2019), and between scientists and the general public (Liu et al., 2022;August et al., 2020a;Cervetti et al., 2015;Freeling et al., 2021). Identifying scholarly jargon is an initial step for designing resources and tools that can increase the readability and reach of science (August et al., 2022a;Plavén-Sigray et al., 2017;Rakedzon et al., 2017). ...
... Language differences among subsets of data can be measured to a variety of approaches, from geometric to information theoretic (Ramesh Kashyap et al., 2021;Vilhena et al., 2014;Aharoni and Goldberg, 2020). We calculate the association of a word's type or sense to subfields using normalized pointwise mutual information (NPMI). ...
... The linguistic insularity of science varies across fields. For example, Vilhena et al. (2014) found that phrase-level jargon separates biological sciences more so than behavioral and social sciences. In addition, articles written by social scientists are sense t1 Table 3: Top five words that have senses associated with each field (S f (t) > 0.1), ordered by the difference ∆ between word-level sense and type NPMI. ...
Full-text available
Scholarly text is often laden with jargon, or specialized language that divides disciplines. We extend past work that characterizes science at the level of word types, by using BERT-based word sense induction to find additional words that are widespread but overloaded with different uses across fields. We define scholarly jargon as discipline-specific word types and senses, and estimate its prevalence across hundreds of fields using interpretable, information-theoretic metrics. We demonstrate the utility of our approach for science of science and computational sociolinguistics by highlighting two key social implications. First, we measure audience design, and find that most fields reduce jargon when publishing in general-purpose journals, but some do so more than others. Second, though jargon has varying correlation with articles' citation rates within fields, it nearly always impedes interdisciplinary impact. Broadly, our measurements can inform ways in which language could be revised to serve as a bridge rather than a barrier in science.
... Our approach builds on the long-standing tradition in the science of science that uses citation networks and text analysis of scientific papers to embody the flow of ideas in science and map its structure, as well as the distribution and spread of knowledge within it 5,[10][11][12][13][14] . Yet the citation networks and the textual similarity between fields are not always aligned. ...
... Yet the citation networks and the textual similarity between fields are not always aligned. There are commonly more citations between fields than we would expect on the basis of the textual similarity of their papers, or conversely, more similarity in the text than we would expect given the number of citations flowing between those fields 12,15 . ...
... In this line of thinking, the misalignment of citations and textual similarity is simply beside the point and does not impact the larger goal of mapping science. In the science of science, meanwhile, the misalignment is taken as a sign that any model of diffusion or communication between scientific fields needs to take both citations and textual similarity into account 11,12,15 . ...
Full-text available
Citations and text analysis are both used to study the distribution and flow of ideas between researchers, fields and countries, but the resulting flows are rarely equal. We argue that the differences in these two flows capture a growing global inequality in the production of scientific knowledge. We offer a framework called ‘citational lensing’ to identify where citations should appear between countries but are absent given that what is embedded in their published abstract texts is highly similar. This framework also identifies where citations are overabundant given lower similarity. Our data come from nearly 20 million papers across nearly 35 years and 150 fields from the Microsoft Academic Graph. We find that scientific communities increasingly centre research from highly active countries while overlooking work from peripheral countries. This inequality is likely to pose substantial challenges to the growth of novel ideas.
... In our research agenda, we leverage research in natural language processing, information retrieval, data mining and human-computer interaction and draw concepts from multiple disciplines. For example, efforts in metascience focus on sociological factors that influence the evolution of science [25], e.g., analyses of information silos that impede mutual understanding and interaction [53] and analyses of macro-scale ...
We stand at the foot of a significant inflection in the trajectory of scientific discovery. As society continues on its fast-paced digital transformation, so does humankind's collective scientific knowledge and discourse. We now read and write papers in digitized form, and a great deal of the formal and informal processes of science are captured digitally -- including papers, preprints and books, code and datasets, conference presentations, and interactions in social networks and communication platforms. The transition has led to the growth of a tremendous amount of information, opening exciting opportunities for computational models and systems that analyze and harness it. In parallel, exponential growth in data processing power has fueled remarkable advances in AI, including self-supervised neural models capable of learning powerful representations from large-scale unstructured text without costly human supervision. The confluence of societal and computational trends suggests that computer science is poised to ignite a revolution in the scientific process itself. However, the explosion of scientific data, results and publications stands in stark contrast to the constancy of human cognitive capacity. While scientific knowledge is expanding with rapidity, our minds have remained static, with severe limitations on the capacity for finding, assimilating and manipulating information. We propose a research agenda of task-guided knowledge retrieval, in which systems counter humans' bounded capacity by ingesting corpora of scientific knowledge and retrieving inspirations, explanations, solutions and evidence synthesized to directly augment human performance on salient tasks in scientific endeavors. We present initial progress on methods and prototypes, and lay out important opportunities and challenges ahead with computational approaches that have the potential to revolutionize science.
... Learning to interpret protein structures is therefore one of the fundamental tasks of a student in an introductory biochemistry course. This topic is traditionally considered difficult, and analysis of semantic distance between fields shows that molecular biology and biochemistry are culturally isolated from other disciplines (3). Therefore, a large corpus of fieldspecific language must be learned starting in the introductory classes, even without considering the information-packed graphical symbology used to express chemical structures. ...
Full-text available
A major challenge for science educators is teaching foundational concepts while introducing their students to current research. Here we describe an active learning module developed to teach protein structure fundamentals while supporting ongoing research in enzyme discovery. It can be readily implemented in both entry-level and upper-division college biochemistry or biophysics courses. Preactivity lectures introduced fundamentals of protein secondary structure and provided context for the research projects, and a homework assignment familiarized students with 3-dimensional visualization of biomolecules with UCSF Chimera, a free protein structure viewer. The activity is an online survey in which students compare structure elements in papain, a well-characterized cysteine protease from Carica papaya, to novel homologous proteases identified from the genomes of an extremophilic microbe (Halanaerobium praevalens) and 2 carnivorous plants (Drosera capensis and Cephalotus follicularis). Students were then able to identify, with varying levels of accuracy, a number of structural features in cysteine proteases that could expedite the identification of novel or biochemically interesting cysteine proteases for experimental validation in a university laboratory. Student responses to a postactivity survey were largely positive and constructive, describing points in the activity that could be improved and indicating that the activity was an engaging way to learn about protein structure.
... Based on these Wikipedia pages, we computed the communication burden of every idea as an indicator of the idea's novelty to history 57 . Especially, the communication burden addresses the rare occurrence (i.e., novelty) of a word or text when the texts belong to different categories (i.e., one idea vs. all Wikipedia pages) 57 . For every idea, we first computed the communication burdens of all words: given the Wikipedia pages collection Q, a word w, and the focused idea d, the communication burden of w was calculated as follows 57 : where p w,d is equal to the number of times w appeared in d divided by the number of all words' appearances in d; p w,Q equals the number of times w appeared in Q divided by the number of all words' appearances in Q. ...
Full-text available
Previous studies demonstrate that people with less professional knowledge can achieve higher performance than those with more professional knowledge in creative activities. However, the factors related to this phenomenon remain unclear. Based on previous Discussions in cognitive science, we hypothesised that people with different amounts of professional knowledge have varying attention deployment patterns, leading to different creative performances. To examine our hypothesis, we analysed two datasets collected from a web-based survey and a popular online shopping website, (United States). We found that during information processing, people with less professional knowledge tended to give their divided attention, which positively affected creative performances. Contrarily, people with more professional knowledge tended to give their concentrated attention, which had a negative effect. Our results shed light on the relation between the amount of professional knowledge and attention deployment patterns, thereby enabling a deeper understanding of the factors underlying the different creative performances of people with varying amounts of professional knowledge.
... Learning to interpret protein structures is therefore one of the fundamental tasks of a student in an introductory biochemistry course. This topic is traditionally considered difficult, and analysis of semantic distance between fields shows that molecular biology and biochemistry are culturally isolated from other disciplines (3). This means a large corpus of field-specific language must be learned starting in the introductory classes, even without considering the informationpacked graphical symbology used to express chemical structures. ...
Full-text available
A major challenge for science educators is teaching foundational concepts while introducing their students to current research. Here we describe an active learning module developed to teach protein structure fundamentals while supporting ongoing research in enzyme discovery. It can be readily implemented in both entry-level and upper-division college biochemistry or biophysics courses. Pre-activity lectures introduced fundamentals of protein secondary structure and provided context for the research projects, while a homework assignment familiarized students with 3D visualization of biomolecules using UCSF Chimera, a free protein structure viewer. The activity is an online survey in which students compare structure elements in papain, a well-characterized cysteine protease from Carica papaya , to novel homologous proteases identified from the genomes of an extremophilic microbe ( Halanaerobium praevalens ) and two carnivorous plants ( Drosera capensis and Cephalotus follicularis ). Students were then able to identify, with varying levels of accuracy, a number of structural features in cysteine proteases that could expedite the identification of novel or biochemically interesting cysteine proteases for experimental validation in a university laboratory. Student responses to a post-activity survey were largely positive and constructive, indicating that the activity helped them learn about protein structure and describing points in the activity that could be improved.
Why do bad methods persist in some academic disciplines, even when they have been widely rejected in others? What factors allow good methodological advances to spread across disciplines? In this paper, we investigate some key features determining the success and failure of methodological spread between the sciences. We introduce a formal model that considers factors like methodological competence and reviewer bias toward one’s own methods. We show how these self-preferential biases can protect poor methodology within scientific communities, and lack of reviewer competence can contribute to failures to adopt better methods. We then use a second model to argue that input from outside disciplines can help break down barriers to methodological improvement. In doing so, we illustrate an underappreciated benefit of interdisciplinarity.
Full-text available
This paper proposes a text-mining framework to systematically identify vanishing or newly formed topics in highly interdisciplinary and diverse fields like cognitive science. We apply topic modeling via non-negative matrix factorization to cognitive science publications before and after 2012; this allows us to study how the field has changed since the revival of neural networks in the neighboring field of AI/ML. Our proposed method represents the two distinct sets of topics in an interpretable, common vector space, and uses an entropy-based measure to quantify topical shifts. Case studies on vanishing (e.g., connectionist/symbolic AI debate) and newly emerged (e.g., art and technology) topics are presented. Our framework can be applied to any field or any historical event considered to mark a major shift in thought. Such findings can help lead to more efficient and impactful scientific discoveries.
We began this work intending to illustrate the network origins of jargon, a signal feature of team learning and the division of labor. In the process, we came to recognize the substantive importance of message timing, which we discuss as the pulse of a network. This paper describes our route to that recognition. We analyze data from a renovated classic network experiment providing empirical support for three hypotheses. The first, and most familiar from past work, is that teams moving down their learning curve to greater efficiency are prone to shared jargon. As a team moves down its learning curve, language drifts away from day-to-day speech, into jargon. The second and third hypotheses concern network correlates of the drift. With respect to network structure, teams are less likely to converge on jargon when communication is concentrated in one teammate. With respect to pulse, teams are more likely to converge on jargon when communication efforts are numerous and crowded in time. The two network predictors overlap conceptually. They both involve learning and access to information, but are distinct in their mechanism: Structure provides access. Pulse creates motivation to access. Teammates keeping up with numerous messages concentrated in time have a shared incentive to find shorthand terms (i.e., jargon) that enable faster exchange of accurate information. Network structure predicts team convergence on jargon, but pulse is a stronger predictor. Directions for new research are discussed.
Full-text available
The rise of the Internet, social media, and digitized historical archives has produced a colossal amount of text-based data in recent years. While computer scientists have produced powerful new tools for automated analyses of such "big data," they lack the theoretical direction necessary to extract meaning from them. Meanwhile, cultural sociologists have produced sophisticated theories of the social origins of meaning, but lack the methodological capacity to explore them beyond micro-levels of analysis. I propose a synthesis of these two fields that adjoins conventional qualitative methods and new techniques for automated analysis of large amounts of text in iterative fashion. First, I explain how automated text extraction methods may be used to map the contours of cultural environments. Second, I discuss the potential of automated text-classification methods to classify different types of culture such as frames, schema, or symbolic boundaries. Finally, I explain how these new tools can be combined with conventional qualitative methods to trace the evolution of such cultural elements over time. While my assessment of the integration of big data and cultural sociology is optimistic, my conclusion highlights several challenges in implementing this agenda. These include a lack of information about the social context in which texts are produced, the construction of reliable coding schemes that can be automated algorithmically, and the relatively high entry costs for cultural sociologists who wish to develop the technical expertise currently necessary to work with big data.
The notion that our society, its education system and its intellectual life, is characterised by a split between two cultures - the arts or humanities on one hand, and the sciences on the other - has a long history. But it was C. P. Snow's Rede lecture of 1959 that brought it to prominence and began a public debate that is still raging in the media today. This 50th anniversary printing of The Two Cultures and its successor piece, A Second Look (in which Snow responded to the controversy four years later) features an introduction by Stefan Collini, charting the history and context of the debate, its implications and its afterlife. The importance of science and technology in policy run largely by non-scientists, the future for education and research, and the problem of fragmentation threatening hopes for a common culture are just some of the subjects discussed.
In 1996, Alan Sokal published an essay in the hip intellectual magazine "Social Text" parodying the scientific but impenetrable lingo of contemporary theorists. Here, Sokal teams up with Jean Bricmont to expose the abuse of scientific concepts in the writings of today's most fashionable postmodern thinkers. From Jacques Lacan and Julia Kristeva to Luce Irigaray and Jean Baudrillard, the authors document the errors made by some postmodernists using science to bolster their arguments and theories. Witty and closely reasoned, "Fashionable Nonsense" dispels the notion that scientific theories are mere "narratives" or social constructions, and explored the abilities and the limits of science to describe the conditions of existence.