ArticlePDF Available

Querying Google Books Ngram Viewer's Big Data Text Corpuses to Complement Research

Authors:

Abstract

If qualitative and mixed methods researchers have a tradition of gleaning information from all possible sources, they may well find the Google Books Ngram Viewer and its repository of tens of millions of digitized books yet another promising data stream. This free cloud service enables easy access to big data in terms of querying the word frequency counts of a range of terms and numerical sequences (and languages) from 1500 - 2000, a 500-year span of book publishing, with new books being added continually. The data queries that may be made with this tool are virtually unanswerable otherwise. The word frequency counts provide a lagging indicator of both instances and trends, related to language usage, cultural phenomena, popularity, technological innovations, and a wide range of other insights. The text corpuses contain de-contextualized words used by the educated literati of the day sharing their knowledge in formalized texts. The enablements of the Google Books Ngram Viewer provide complementary information sourcing for designed research questions as well as free-form discovery. This tool allows downloading of the "shadowed" (masked or de-identified) extracted data for further analyses and visualizations. This chapter provides both a basic and advanced look at how to extract information from the Google Books Ngram Viewer for light research.
Exploring the
Google Books Ngram Viewer
for “Big Data” Text Corpus
Visualizations
SHALIN HAI-JEW
KANSAS STATE UNIVERSITY
SIDLIT 2014 (OF C2C)
JULY 31 AUG. 1, 2014
Presentation Overview
As part of the Google Books digitization project, the Google Books Ngram Viewer
(https://books.google.com/ngrams) was released in late 2010 to enable public querying of a “shadow dataset”
created from the tens of millions of digitized books. The texts are from a 500-year span (1500-2000+), with new texts
added fairly continuously, and there are a range of datasets of different text corpuses (and in different languages,
like Italian, French, German, Spanish, Russian, Hebrew, and simplified Chinese). The name of the tool comes from a
computer science term referring to strings of alphanumeric terms in particular order: a unigram (or one-gram)
consists of one entity, a bigram (or two-gram) consists of two entities, onwards. (Its precursor was a prototype
named “Bookworm.”) Users may acquire the (de-contextualized) word or phrase or symbol frequency counts of
terms in bookswhich provide a lagging indicator of trends (over time), public opinion, and other phenomena.
The Ngram Viewer has been used to provide insights on diverse topics such as the phenomena of fame (and the
fields which promote fame), collective forgetting, language usage, cultural phenomena, technological
innovations, and other insights. The data queries that may be made with this tool are virtually unanswerable
otherwise. The enablements of the Google Books Ngram Viewer provide complementary information sourcing for
designed research questions as well as free-form discovery. This tool is also used for witty data visualizations (such
as simultaneous queries of “chicken” and “egg” to see which came first) based on the resulting plotted line chart.
The tool also enables the download of raw dataset information of the respective ngrams, and the findings are
released under a generous intellectual property policy. This presentation will introduce this semi-controversial tool
and some of its creative applications in research and learning.
2
3
Welcome!
Hello! Who are you?
Any direct experiences with n-grams? Any research angles that may be
informed by n-grams?
What are your interests in terms of the Google Books Ngram Viewer? What
is your level of experience with this Viewer?
4
What is the
Google Books
Ngram Viewer?
A SIMPLE OVERVIEW
5
History
Google Books project (conceptualized from 1996, official secret launch in
2002, announcement of “Google Print” project in 2005, new user
interface in 2007, known also as Google Book Search)
Over 100 million digitized books from 1500s to the present
Book Search interface available in 35 languages
Over 10,000 publishers and authors from 100+ countries in the Book Search
Partner Program
Integrated with Google Web Search
Public domain works in full view, copyrighted works with snippets (“Google
Books”)
6
History (cont.)
Derived shadow dataset: Bookworm Ngrams -> Ngram Viewer
Based on a “bag of words” approach
Launched in late 2010
Google Books Ngram Viewer prototype (then known as “Bookworm”)
created by Jean-Baptiste Michel, Erez Aiden, and Yuan Shen…and then
engineered further by The Google Ngram Viewer Team (of Google
Research)
7
History (cont.)
Includes a number of corpuses across
many languages (finer details of each
corpus)
Current corpuses
American English 2012, American
English 2009
British English 2012, British English 2009
Chinese 2012, Chinese 2009
English 2012, English 2009
English Fiction 2012, English Fiction 2009
English One Million
French 2012, French 2009
German 2012, German 2009
Hebrew 2012, Hebrew 2009
Spanish 2012, Spanish 2009
Russian 2012, Russian 2009
Italian 2012
8
Some Terminology
Digital humanities: Research from computation and disciplines in the
humanities
Big data: Datasets with an “n of all,” large datasets with a large number of
records (such as in the millions)
N-gram: A contiguous sequence of n items from text or speech (unigram,
bigram / digram, trigram, four-gram, five-gram, and so on), often
representing a concept
Text corpuses: Collections of text such as manuscripts or microblogging
streams or other texts
Shadow dataset: Masked or de-identified extrapolated data
9
Some Terminology (cont.)
Frequency count: The number of times a particular n-gram appears in a
text corpus
Smoothing: The softening of spikes by averaging data from preceding and
following years to indicate a “moving average” (a smoothing of 1 means
a datapoint on the linegraph is the average of the count for 1 year to
each side; a smoothing of 2 means the average of the count for 2 years to
each side, etc.)
Data visualization: The image-based depiction of data for the
identification of patterns
10
N-grams
Number of Ngrams
Examples (comma
-separated)
Unigram
or one-gram
time, $, Julia, pi, 3.14159265359
Bigram or digram or two
-gram
borrow money, return home, déjà vu,
golden mean
Trigram
her own purse, the
trip abroad
Four
-gram
the time it took, the Merchant of
Venice
Five
-gram
when he left the store, after she fed
the dog
Six
-gram
I plan to travel very soon
Seven
-gram
The president
will visit the state
tomorrow
11
<3.14159265359>
12
A “Shadow Dataset” of Google Books
Collections
Shadowing the dataset to protect against privacy infringement or reverse
engineering of manuscripts
The machine extraction of n-grams from the de-contextualized texts
Pure frequency counts of the n-grams in various sequences (but only if they
reach a threshold of ngrams that appear in 40 or more books to keep the
processing manageable)
Tagging of parts of speech (POS) that structure language but do not hold
semantic value
The elimination of unique phrase sequences to avoid potential hacking and
reverse-engineering of particular texts
13
A “Shadow Dataset” of Google Books
Collections (cont.)
Depiction of frequency counts over time (with defined and editable start-and-
end years) for broad-scale trending
Ability to compare multiple words and phrases
Value Added Capabilities
Downloadable n-gram datasets (for further analysis)
Interactive visualizations from mouseovers
Machine-highlighted years of interest
Linkage to original texts (on Google)
Choices from dozens of different and multilingual corpuses (French, simplified
Chinese, Italian, Russian, Spanish, Hebrew, German, and others)
14
Anomalous Years of Interest; Links
15
<foot and mouth disease, FMD, hoof
and mouth disease>
16
Downloadable Experimental Datasets
17
Some Controversies with the Ngram
Viewer
Decontextualized machine “analysis” vs. contextualized reading and
human expertise
Machine “(non)reading” (in frequency counts) vs. human reading
(symbolic decoding), a quantitative vs. a qualitative focus, an
overbalance into computational understandings (a quantity of words
separated from conscious expressed meaning and author hand)
Example from Karen Reimer’s “Legendary, Lexical, Loquacious Love” (1996), a
deconstructed book which consisted of lists of alphabetized contents (per
Uncharted…)
Inability to verify results outside of Google Books Ngram Viewer
Some degree of elusive “black box” lack of knowledge about functionality
18
19
Research Potential?
On first blush, what do you think can be learned from such data
extractions? Why?
What can be asserted from the linegraphs?
Would publishers ever accept a deconstructed book for publication (or do
you think these are mainly one-offs??
20
A Cursory Overview of Research
Findings So Far from Ngram Viewer
Fame: Who gets famous, and how? What sorts of professions lead to
fame?
Collective memory: In terms of human memories of events, how long do
people tend to remember? What is the trajectory of collective
consciousness from knowing to not knowing?
Adoption of innovations: What is the typical time length for people to
accept technological and other innovations? How do cultural
phenomena affect human populations over time?
Language evolution: How does language evolve over time? How do rules
of language become normalized?
21
A Cursory Overview of Research
Findings So Far from Ngram Viewer
(cont.)
First-use of terms: When was the time when a term was first used? (such as
terminology linked to technological innovations) (a form of fact-checking)
Popularity: Between various artists / scientists / politicians, who was more
popular in his / her day?
Government Censorship: What was the role of Nazi censorship of certain
Jewish artists in terms of their reputations and mentions / non-mentions in
the literature?
22
Some Examples USING GENERAL DATA
EXTRACTIONS
23
Single Extraction
<dimsum>
24
Two Element Comparison
<dimsum,tapas>
25
Two Element Comparison (cont.)
<future,past>
26
Multi-Element Comparisons and Contrasts
<height,weight> <diet,exercise>
27
Year(s)
<1984, 1999, 2010>
28
Symbols
<&, ampersand>
29
Phrases: Multiword Strings
<crossing the Rubicon>
30
(Somewhat
More)
Advanced
Searches
WILDCARD EXTRACTIONS
INFLECTION SEARCHES
CASE SENSITIVITY / CASE
INSENSITIVITY
ACCESSING TAGGING (AS FOR
PARTS OF SPEECH)
RICHER COMBINATIONS
SOME BOOLEAN-BASED
QUERIES
STARTS AND ENDS OF
SENTENCES
DEPENDENCY RELATIONS
ROOTS OF PARSE TREES
31
Wildcard Search *
The use of * to stand in for a word so the Ngram Viewer will display the top
ten substitutes (of most common stand-in words) for the asterisk
32
<the * of his life>
33
<the * of her life>
34
<a betrayal of the *>
35
Inflection Search with _INF
Differentiation of various word forms
(infinite verb form)
-ed
-ing
-s
(irregular spellings)
36
<tell_INF>
37
Case Sensitive / Insensitive Searches
Case Sensitive
Capitalizations and lower cases
matter
Case Insensitive
Capitalizations and lower cases do not
matter
Case insensitivity will result in a variety
of capitalization / lower case mixes
and variations for a particular search
term
38
Case Sensitive
<RICO>
39
Case Insensitive
<RICO>
40
Part-of-Speech Tags
Disambiguation of term to defining its usage as a part-of-speech to
capture the conceptual usage
May be used as stand-alones (_VERB_) or appended to a verb (play_VERB)
41
42
TAGS
APPLICATION
_NOUN_
Noun
_VERB_
Verb
_ADJ_
Adjective
_ADV_
Adverb
_PRON_
Pronoun
_DET_
Determiner or article
_ADP_
Adposition (preposition or
postposition)
_NUM_
Numeral
_CONJ_
Conjunction
_PRT_
Particle
_ROOT_
Root of the parse tree
_START_
Start of a sentence (sentence
boundary)
_END_
End of a sentence (sentence
boundary)
<play_NOUN, play_VERB>
43
Some Combinations
Inflection keyword with part-of-speech text
buy_INF, buy_VERB_INF (buy, buying, bought, buys)
Dependencies with wildcards
ride=>*_NOUN (ride car; ride bike; ride bus)
44
<buy_INF, buy_VERB_INF>
45
<ride=>*_NOUN>
46
Ngrams at the Starts and Ends of
Sentences
Sentence Boundary Indicators
_START_
_END_
47
<_START_ The USSR,_START_ The US>
48
Dependency Relations with =>
Operator
main noun => descriptor (the main noun dependent on the descriptor)
49
<home=>sweet>
50
<apocalypse=>zombie>
51
Root: _ROOT_
Stands for the root of the parse tree (syntax tree) connected based on the
syntax
Placeholder for “what the main verb of the sentence is modifying”
(“Google Books Ngram Viewer”)
Does not stand in for a word or a position in a sentence
52
<_ROOT_=>win,_ROOT_=>lose>
53
Ngram
Compositions
USING OPERATORS
( )
54
Operators ( )
Operators
Functions
+
sums expressions on the left and the
right to combine
multiple ngram time series into one line in the linegraph
(can combine multiple added sequences)
-
subtracts expression on right from the left (need spaces on
either side of the minus sign)
/
divides string on left by expression on right
*
multiplies expression on left by number on right to compare
ngrams of different frequencies with entire ngram in ( ) so
the asterisk is seen as a multiplication sign and function
:
applies ngram on left to the text corpus on the right to
enable comparisons against different corpuses
55
+ operator
<Russian Federation, Russia, USSR,
RussianFederation+Russia+USSR>
56
- operator ( + ) ( + )
<breakfast,lunch,dinner,breakfast+lunch+dinn
er,(breakfast+lunch+dinner)-(snacks)>
57
/ operator
<traumatic brain injury, TBI, mTBI,(traumatic
brain injury / (traumatic brain injury+TBI+mTBI)>
58
* multiplication operator
<HUMINT,GEOINT,MASINT,OSINT,SIGINT,TECHINT,(CYBINT*
1000),DNINT,(FINNT*1000)>
To explore visualizations between texts with widely varying frequencies
The Data Extractions
HUMINT,GEOINT,MASINT,OSINT,SIGINT,TECHINT,CYBINT,DNINT,FINNT
HUMINT,GEOINT,MASINT,OSINT,SIGINT,TECHINT,(CYBINT*1000),DNINT,(FINNT*
1000) (to no avail since “CYBINT” and “FINNT” do not have sufficient
occurrences to count in the Ngram Viewer)
59
60
: operator
<Beijing:eng_gb_2012,Beijing:chi_sim_2012,
Beijing:eng_us_2012>
Beijing:eng_gb_2012,Beijing:chi_sim_2012,Beijing:eng_us_2012
Some definitions of the pointed-to datasets
eng_gb_2012: “Books predominantly in the English language that were
published in Great Britain”
chi_sim_2012: “Books predominantly in simplified Chinese script”
eng_us_2012: “Books predominantly in the English language that were
published in the United States”
61
62
Two Steps to Privacy: 隐私
63
Popularization of Term with Broader
Interactions Globally
<隐私>
64
Any Ideas for Using the Ngram Viewer
for your Fun, Work, and Research?
Fun
Stories to tell friends
Work
Insights that may affect
knowledge and decision-
making
Research
Citable information from the
world’s collective knowledge
conveyed through books…
65
Ngram Viewer
Applications for
Visual Wordplay
and Wit
A ONE-SCREEN TEXT-
BASED VISUAL TO
ACCENTUATE A
WEBSITE,
PRESENTATION, OR
PUBLICATION?
66
<d’oh>
67
Chicken or the Egg?
<chicken,egg>
68
Five Elements
<metal,wood,water,fire,earth>
69
Good and Evil
<good,evil>
70
<Buddhism, Christianity, Hinduism,
Islam,Judaism, Mormonism, Sikhism>
71
<Oriental, Asian>
72
Ivy League Institutions
<Brown University, Columbia University, Cornell University,
Dartmouth College, Harvard University, Princeton
University, the University of Pennsylvania, Yale University>
73
Caveat
The ease of accessing and understanding the visualization may mean a
potential misunderstanding of the underlying information… This is partly a
product of the cognitive bias known as the “availability heuristic,” with
more easeful and faster ideas coming to mind accepted as truth.
Visualizations like this are highly overly simplified as compared to the
underlying realities.
Researchers need to make sure to head off potential misunderstandings
with such Ngram Viewer linegraph visualizations when using these as
“accents” or as “invitations” to people to learn more.
74
Ngram Viewer
Uses in Research EARLY THOUGHTS
75
Some Possible Research Applications
of the Ngram Viewer
Variations of the following:
Competition between languages and phrases (their origins and trajectories /
trends over time, word and phrase gists over time, multilingual queries, and
others)
Cultural understandings and cross-cultural insights; popular sentiment and
understandings
Analysis of research capabilities and understandings (historically and through
the present)
Population readiness for accepting particular ideas through big data text
corpus analysis
Literary terms of art and their uses over time
76
Some Possible Research Applications
of the Ngram Viewer (cont.)
Effects of historical events (governance, social phenomena, wars, health issues,
and others) on language
Biographical insights on historical figures (particularly comparative insights)
Research lead creation; research source identification
and many others
77
78
Qualifiers and Clarifications
Words in books as a lagging (vs. leading) indicator
Changing authorship and access to literacy and publication over time
(with the changing roles of books from years of formalism to much less
formalism in the present day)
Word frequency counts as one information stream among many
Still a critical role for close readings of select publications
Ngram Viewer counts are much more effective and informative when
used with complementary streams of information and in-depth analysis
79
Particular Researcher Requirements
General
Understandings
Language literacy, optimal
multilingual literacy
Digital (and computational)
literacy
Understandings of history
Understandings of the changing
roles of authors and books
Understandings of big data and
big data analyses
Domain +
Computational
Understandings
Deep knowledge of particular
domain fields and related fields
Understandings of language uses
and publications in-the-field
Computational set thinking
Uses of the Ngram
Viewer Tool
Openness to discovery learning
Knowing what to query and how
(particularly with creative query
setups, year adjustment
parameters, links to documents)
Knowing what may be asserted
and what may not be asserted
(ability to qualify assertions)
Knowing when to conduct
complementary and follow-on
research (including close
readings)
80
Initial Ngram Viewer Tips
Start simple. Once the basic extractions are acquired, try the more
complex ones using tagging and combinatorial approaches. Broaden out
to foreign languages.
Reload the Ngram Viewer if a “flatline” is attained because “under heavy
load, the Ngram Viewer will sometimes return a flatline”.
Text corpuses accessed by the Ngram Viewer are always changing, and
more data is added all the time. It may help to capture a sequence of
data extractions to what changes there may be.
81
Tips on Research Approaches
Shape a data query both for need-to-know and to the limits of the massive
dataset.
Err on the side of making a number of various runs for a data query. Keep
good records of the data extractions.
Take time to actually analyze the results. Sometimes, because the
extractions occur in milliseconds, just making a cursory look at the
linegraph seems sufficient…but much more can be learned by interacting
with the visual. (One can explore years, resources, and other aspects, for
example.)
Keep a research journal of observations and findings. Note your learning
about the tool as well.
82
Tips on Research Approaches (cont.)
Spend some time to discover the tool by making various runs purely for
discovery learning. Structure some of these explorations with thought
experiments.
Branch out beyond the Ngram Viewer by analyzing the extracted datasets
in other tools. One freeware and open-source one is the Ngram Statistics
Package.
Also, use other datasets (such as from social media platform-extracted big
data corpuses) for analysis. One such publicly available set is the Rovereto
Twitter N-Gram Corpus.
83
Tips on Research Approaches (cont.)
There is a broad and wide literature on the machine analysis of human
language: natural language processing, stylometry, computational
linguistics, sentiment analysis, personality analysis, speech recognition, and
others. There are automated text summaries (with efforts towards
accuracy and “grammaticality”). There are language models used for
speech recognition and machine translation between languages. A core
unit underlying these approaches are n-grams. It may help to delve more
deeply for certain types of research to more fully contextualize research-
based approaches.
84
Creative Commons Release
Currently, datasets and graphs are released
through a Creative Commons Attribution 3.0
Unported License.
Graphs may be used “freely…for any purpose”
albeit acknowledgment of Google Books
Ngram Viewer and link to
http://books.google.com/ngrams is desirable.
85
References
Aiden, E. & Michel, J.-B. (2013). Uncharted: Big Data as a Lens on Human
Culture. New York: Riverhead Books.
Mayer-Schönberger, V. & Cukier, K. (2013). Big Data: A Revolution that will
Transform How We Live, Work, and Think. New York: Houghton Mifflin
Harcourt Publishing Company.
86
Conclusion and Contact
Dr. Shalin Hai-Jew
Instructional Designer
Information Technology Assistance
Center
Kansas State University
212 Hale Library
785-532-5262
shalin@k-state.edu
87
... To define the trends in changes in the frequency of the words 'readability and 'chitabelnost' between 1920 and 2019 we applied distributional semantic models (Firth, 1957), which allowed us to induce the meanings of words from texts. The algorithm of diachronic studies of words and concepts with Google Books Ngram, i.e. a database of 67 billion words in Russian Viewer and 361 billion words in English, has been successfully implemented by numerous researchers in Russia and abroad (see Solovyev, 2013, Zakharovet al., 2014, Hai-Jew, 2014. ...
Article
Full-text available
The article presents the results of an original study aimed at finding (1) frequency fluctuations of the term ‘readability’ in American discourse and its Russian equivalent ‘chitabelnost’ in Russian discourse over the period from 1920s to the present; and (2) semantic similarities and differences between the English term ‘readability’ and its Russian equivalent ‘chitabelnost’ over the same period of time. A contrastive analysis of the words testified to inconsiderable differences in the semantic structures of the terms in the period under study: the term ‘readability’ has been used with the following meanings: (1) ‘the quality of being legible or decipherable’ and (2) ‘the quality of being easy or enjoyable to read’. The Russian equivalent ‘chitabelnost’ has two contemporary meanings similar to the aforementioned English meanings as well as the obsolete ‘library book checkouts’. With the help of the Google NgramViewer, we identified the 1980s frequency peak of both terms when the modern notion of the concepts was formed. The research into the topical context of readability as ‘the quality of being easy or enjoyable to read’ demonstrated empiricist tendencies in American studies focused on two types of parameters, i.e. the ‘objective’ parameters of texts, i.e. sentence length, word counts, number of high/low frequency words, ratio of high/low frequency words to total words, sentence complexity, etc. and ‘individual’ variables affecting a potential reader, such as ‘word familiarity’, cognitive and linguistic abilities, cultural and topic knowledge, etc. The Russian school’s view, until the 1970s, had traditionally been more holistic and ‘biased’ towards an individuals’ factors. The results of the study have the potential to contribute to cross-linguistic research in the area of text readability assessment, semantics, and scientific literature searches.
Article
Launched at the end of 2010, Ngram Viewer can be used to detect trends in word usage in the millions of documents digitized by Google Books, covering a period from the sixteenth century up to the present day (eighteenth century for the French corpus). This article exploits the capabilities of this new application to examine the changing visibility of demographic vocabulary in written culture. It begins by looking at how data are selected and organized in Ngram Viewer, and shows that the counting of word sequences (or ngrams) without reference to context a shortcoming pointed up by critics is not an insurmountable problem. It then focuses on the main themes of demography, showing that the decline in demographic terminology since the 19905 is not an artefact. This decline is most visible for the demographic concepts linked to the marriage model, and for technical terms now confined to scientific journals (not covered by Ngram Viewer). An upward trend is observed, on the other hand, for terms linked to the social questions attracting a new generation of researchers, such as infecundity, perinatal mortality, sexual orientation, new transitions to adulthood, causes of death, health inequalities, gender relations, integration and discrimination, violence, systems of values. This suggests that demography must broaden its horizons if it wishes to maintain its former visibility and restore the link between science and society that has become so fragile today.
Article
Full-text available
Change over time in culture can appear among individuals and in cultural products such as song lyrics, television, and books. This analysis examines changes in pronoun use in the Google Books ngram database of 766,513 American books published 1960-2008. We hypothesize that pronoun use will reflect increasing individualism and decreasing collectivism in American culture. Consistent with this hypothesis, the use of first person plural pronouns (e.g., we, us) decreased 10% first person singular pronouns (I, me) increased 42%, and second person pronouns (you, your) quadrupled. These results complement previous research finding increases in individualistic traits among Americans.
Article
Full-text available
The Google Books Ngram Viewer allows researchers to quantify culture across centuries by searching millions of books. This tool was used to test theory-based predictions about implications of an urbanizing population for the psychology of culture. Adaptation to rural environments prioritizes social obligation and duty, giving to other people, social belonging, religion in everyday life, authority relations, and physical activity. Adaptation to urban environments requires more individualistic and materialistic values; such adaptation prioritizes choice, personal possessions, and child-centered socialization in order to foster the development of psychological mindedness and the unique self. The Google Ngram Viewer generated relative frequencies of words indexing these values from the years 1800 to 2000 in American English books. As urban populations increased and rural populations declined, word frequencies moved in the predicted directions. Books published in the United Kingdom replicated this pattern. The analysis established long-term relationships between ecological change and cultural change, as predicted by the theory of social change and human development (Greenfield, 2009).
Article
There has been widespread excitement in recent years about the emergence of large-scale digital initiatives (LSDIs) such as Google Book Search. Although many have become excited at the prospect of a digital recreation of the Library of Alexandria, there has also been great controversy surrounding these projects. This article looks at one of these controversies: the suggestion that mass digitization is creating a virtual rubbish dump of our cultural heritage. It discusses some of the quantitative methods being used to analyse the big data that have been created, and two major concerns that have arisen as a result. First, there is the concern that quantitative analysis has inadvertently fed a culture that favours information ahead of traditional research methods. Second, little information exists about how LSDIs are used for any research other than quantitative methods. These problems have helped to fuel the idea that digitization is destroying the print medium, when in many respects it still closely remediates the bibliographic codes of the Gutenberg era. The article concludes that more work must be done to understand what impact mass digitization has had on all researchers in the humanities, rather than just the early adopters, and briefly mentions the work that the author is undertaking in this area.
Article
Research on secular trends in mean intelligence test scores shows smaller gains in vocabulary skills than in nonverbal reasoning. One possible explanation is that vocabulary test items become outdated faster compared to nonverbal tasks. The history of the usage frequency of the words on five popular vocabulary tests, the GSS Wordsum, Wechsler Adult Intelligence Scale (WAIS), Wechsler Adult Intelligence Scale–Revised (WAIS-R), Wechsler Intelligence Scale for Children (WISC), and Wechsler Intelligence Scale for Children–Revised (WISC-R) IQ tests, was analyzed by means of the Google ngram viewer. Usage frequency had a 0.38 to 0.73 correlation with item difficulty. In the period between test standardizations, the median change in usage frequency was −17% for WISC words, −8% for Wordsum, −5% for WISC-R, −4% for WAIS, and 0% for WAIS-R words. The correlation between median change in usage frequency and gain in vocabulary score was 0.33. Further studies with a larger set of vocabulary tests are needed to analyze in more detail the magnitude of the effect of changing word usage frequencies.
Conference Paper
We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those related to the evolution of syntax.
Article
Retroactive digitization of the printed corpus reveals patterns of social evolution and political manipulation. Prior to the introduction of online search, examining the etymology of a political term helped social theorists extrapolate the historical trajectory of a political concept. Now we have tools for quantitative examination of such hypotheses. Using four case studies, I demonstrate the utility of word search dynamics in shedding light on the evolution of long-debated political phenomena. A triad of etymological explorations by Richard Koebner motivates the first three cases.
Article
There is no agreement on how to formally incorporate affective data into statistical analysis and research conclusions. The information systems (IS) literature has recently published several position papers that have established a framework and perspective for using affective technology in IS research though. The frameworks have not been extensively tested, and are likely to evolve over time as empirical studies are conducted, and the validity of the methodologies is confirmed or disproved. A major goal of the current paper is to take the initial steps in translating the frameworks to usable methodologies, with application to improving our understanding of how to make effective empirical tests. This paper also investigates the adoption cycle of one of these technologies—electrodermal response (EDR) technologies—whose incarnation in the polygraph in forensic applications went through a complete adoption cycle in the twentieth century. The use of EDR response data in marketing research and surveys is nascent, but prior experience can help us to forecast and encourage its adoption in new research contexts. This research investigates three key questions: (1) What technology adoption model is appropriate for electrodermal response technology in forensic science? (2) What is the accuracy of affective electrodermal response readings? (3) What information is useful after superimposing affective EDR readings on contemporaneous survey data collection? Affective data acquisition technologies appear to add the most information when survey subjects are inclined to lie and have strong emotional feelings. Such data streams are informative, non-invasive and cost-effective. Informativeness is context-dependent though, and it relies on a complex set of still poorly understood human factors. Survey protocols and statistical analysis methods need to be developed to address these challenges.
Article
Franco Moretti's call for “distant reading” of texts is being been taken up by scholars working in a digital milieu, though not without controversy. With mass digitization of print culture the potential for new types of investigation into the human condition is enormous. In this paper, we will present and contextualize work on Project Bamboo in light of new modes of humanities digital scholarship and reading, including text mining and corpora analysis. The paper will look at some approaches - including the Google Ngram viewer - as examples of the benefits and pitfalls of textual analysis at scale. It will then argue for a scalable approach to humanities research, both distant and close.
Article
“Literature is an artificial universe,” author Kathryn Schulz recently declared in the New York Times; “the written word, unlike the natural world, can’t be counted on to obey a set of laws.” Schulz was criticizing the value of Franco Moretti’s “distant reading,” although her critique seemed more like a broadside against “culturomics,” the aggressively quantitative approach to studying culture (Michel et al.). Culturomics was coined with a nod to the data-intensive field of genomics, which studies complex biological systems using computational models rather than the more analog, descriptive models of a prior era. Schulz is far from alone in worrying about the reductionism that digital methods entail, and her negative view of the attempt to find meaningful patterns in the combined, processed text of millions of books predominates in the humanities. Historians largely share this skepticism toward what many view as superficial approaches focused on word units in the same way that bioinformatics focuses on DNA sequences. Many of our colleagues question the validity of text mining because they have generally found meaning in a much wider variety of cultural artifacts than just text, and, like most literary scholars, consider words themselves to be context-dependent and frequently ambiguous. Although occasionally intrigued by it, most historians have taken issue with Google’s Ngram Viewer, the search company’s tool for scanning literature by n-grams, or word units. Michael O’Malley, for example, laments that “Google ignores morphology: it ignores the meanings of words themselves when it searches. . . . [The] Ngram Viewer reflects this lack of interest in meaning. It disambiguates words, takes them entirely out of context and completely ignores their meaning . . . something that’s offensive to the practice of history, which depends on the meaning of words in historic context.” Such heated rhetoric—probably inflamed in the humanities by the overwhelming and largely positive attention that culturomics has received in the scientific and popular press—unfortunately has forged in many scholars’ minds a cleft between our beloved, traditional close reading and untested, computer-enhanced distant reading. But what if we could move seamlessly between traditional and computational methods as demanded by our research interests and the evidence available to us? In the course of several research projects exploring the use of text mining in history, we have found that it is both possible and profitable to move between these supposed methodological poles. Indeed, one of the most productive and thorough ways to do research, given the recent availability of large archival corpora, is to have a conversation with the data in the same way that we have traditionally conversed with literature: by asking it questions, ascertaining what the data reflects back, and combining digital results with other evidence acquired through less technical means. We provide here several brief examples of this combinatorial approach using both textual work and technical tools. Each example shows how the technology can help flesh out prior historiography as well as provide new perspectives that advance historical interpretation. In each experiment, we have tried to move beyond the more simplistic methods made available by Google’s Ngram Viewer, which traces the frequency of words in print over time with little context, transparency, or opportunity for interaction. One of our projects, funded by Google, gave us a higher level of access to their millions of scanned books, which we used to revisit Walter E. Houghton’s classic The Victorian Frame of Mind, 1830–1870 (1957). We wanted to know if the themes Houghton identified as emblematic of Victorian thought and culture—based on his close reading of some of the most famous works of literature—held up against Google’s nearly comprehensive collection of over a million Victorian books. We selected keywords from each chapter of Houghton’s study—loaded words like “hope,” “faith,” and “heroism” that he deemed central to the Victorian mindset and character—and queried them (and their Victorian synonyms, to avoid literalism) against a special data set of titles of nineteenth-century British printed works. A grid of search results showing the frequency of a hundred words in the titles of books and their change between 1789...