Content uploaded by Boyang Li
Author content
All content in this area was uploaded by Boyang Li on Mar 13, 2018
Content may be subject to copyright.
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID 1
Predicting Personality from Book Preferences
with User-Generated Content Labels
Ng Annalyn, Maarten W. Bos, Leonid Sigal, and Boyang Li
Abstract—Psychological studies have shown that personality traits are associated with book preferences. However, past
findings are based on questionnaires focusing on conventional book genres and are unrepresentative of niche content. For a
more comprehensive measure of book content, this study harnesses a massive archive of content labels, also known as ‘tags’,
created by users of an online book catalogue, Goodreads.com. Combined with data on preferences and personality scores
collected from Facebook users, the tag labels achieve high accuracy in personality prediction by psychological standards. We
also group tags into broader genres, to check their validity against past findings. Our results are robust across both tag and
genre levels of analyses, and consistent with existing literature. Moreover, user-generated tag labels reveal unexpected
insights, such as cultural differences, book reading behaviors, and other non-content factors affecting preferences. To our
knowledge, this is currently the largest study that explores the relationship between personality and book content preferences.
Index Terms—Personality Profiling, Narrative Preferences, Social Media, Behavioural Footprints
—————————— u ——————————
1 INTRODUCTION
“Histories make men wise;
poets witty;
the mathematics subtle;
natural philosophy deep;
moral grave;
logic and rhetoric able to contend.”
By Francis Bacon, Of Studies (1597)
rancis Bacon may have been the first to suggest a corre-
lation–perhaps even a causal relation–between book
preferences and the personality of readers. Indeed, re-
search has found that reading fiction leads to changes in
personality [1] and increased empathy [2]. While book
reading may influence personality, personality in turn may
affect book choice. This is supported by correlations found
between personality and book preferences [3]. Being able
to predict book preferences using readers’ personality has
many potential applications, such as personalizing prod-
ucts and services, improving recommender systems, and
enabling targeted advertising.
However, due to difficulties in data collection, research
on personality and book preferences typically focus on a
few dozen book genres or less, such as having four genres
for novels [4], 16 genres for books [5], or 34 genres for
books and magazines combined [6]. The largest study [3]
to our knowledge inspected 81 book topics and their corre-
lations with readers’ personality. Narrow categorizations
of book content can be problematic as preferences for niche
genres may be inaccurately inferred. Moreover, studies
measured book preferences using self-report question-
naires, which can be lengthy and thus vulnerable to errant
or null responses [7].
To im pro ve bo th t he q ua nt ity an d qu al it y of d at a fo r ou r
study, we combine two online data sources. For data on
book content, we use over 24,000 user-supplied tags from
a book catalogue website, GoodReads.com. For data on
reader personality and book preferences, we use a data-
base of Facebook profiles comprising more than 60,000 re-
spondents who had ‘liked’ book-themed Facebook pages
and who had also completed a personality survey on the
social networking site [8].
We a do pt the Big Five personality model, also known as
the five-factor model, which consists of extraversion,
agreeableness, openness, neuroticism and conscientious-
ness. This set of five traits is known to predict a wide range
of behaviors and psychopathology [9]. We briefly review
their known associations with book preferences here:
Extraverts enjoy social activities and have high arousal
levels [9]. They prefer content related to social activities
such as parties [10], as well as arousing content such as hor-
ror [6]. Hence, we hypothesize that extraversion would be
associated with a preference for genres with socially ori-
ented themes, as well as genres which are stimulating.
Agreeable individuals are kind and considerate [9], and
tend to empathize with story characters [11]. They prefer
narratives on positive social relationships, such as romance
and family [6], hence they are likely to steer clear of violent
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
————————————————
• N. Annalyn is with the Ministry of Defence (Singapore), #16-01 Defence
Te ch n o lo g y To w e r B , Singapore 109681.
E-mail: ng_li_ting_annalyn@defence.gov.sg.
• M. W. Bos, L. Sigal, and B. Li are with Disney Research Pittsburgh, 4720
Forbes Avenue, Lower Level, Suite 110, Pittsburgh, PA 15213.
E-mail: {mbos, lsigal, albert.li }@disneyresearch.com.
F
2 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
or disturbing themes. As agreeable people tend to evaluate
media content favorably in general [10], we might also ex-
pect them to ‘like’ more books on average.
Open individuals seek intellectual stimulation and are
comfortable with new ideas [9]. Openness predicts a pref-
erence for avant-garde genres [12], and has been a con-
sistent predictor of fiction exposure [2], [3], [4], [13]. Based
on existing literature, we hypothesize that individuals
with high openness would appreciate intellectually stimu-
lating and fiction genres. Since reading is an intellectual ac-
tivity, we would also expect open individuals to ‘like’ more
books in general.
Neurotic individuals are emotionally unstable. Being
prone to feeling lonely and depressed [14], they may use
media narratives as a means of escape from everyday life
[4]. Hence, we hypothesize that neurotic individuals
would prefer narratives about feel-good, alternative reali-
ties.
Conscientious individuals are achievement-striving
and self-disciplined, preferring deliberate planning over
spontaneity [9]. They have also been reported to like non-
fiction content such as news and politics [6], [15]. Hence,
we hypothesize that conscientiousness would be associ-
ated with a preference for books of informative and practi-
cal content.
Merging personality data (from Facebook surveys) with
book data (from Facebook ‘likes’ and GoodReads), we con-
duct three levels of analysis. The first is a tag-level analysis,
in which we correlate personality with book tags. As tags
were spontaneously generated by readers themselves, they
contain richer information on book content compared to
the usual, smaller set of genre categories. Second is a genre-
level analysis. To verify that our findings are consistent
with those from previous studies, tags are clustered into
broader book genres and then correlated with personality
again. Third, we examine whether personality has an over-
arching influence on one’s tendency to like books in gen-
eral.
2 DATA COLLECTION
2.1 Personality and Book Preferences
We use data collected from a Facebook app called myPer-
sonality [8], which allows users to measure their Big Five
personality traits with the International Personality Item
Pool questionnaire [16]. Because users received feedback
on their personality scores, they were likely to be moti-
vated to respond diligently.
Besides their personality scores, users of the myPerson-
ality app also shared which Facebook pages they had
‘liked’. Facebook pages can be dedicated to any entity, such
as a book, movie, or celebrity (see Fig. 1 for an example).
For our study, we focus on pages labeled as books. This en-
ables us to examine the correlation between users’ person-
ality scores and their book preferences.
Data collected via Facebook has been shown to be com-
parable to data collected via standalone websites [17].
Moreover, the Facebook personality dataset we use in this
study has successfully predicted a range of personal traits,
from web browsing habits [18] to language use [19].
Yo uy ou e t al . even suggested that personality inferences of
users made based on digital footprints such as Facebook
‘likes’ are more accurate than those made by users' friends
[20]. Due to its wide adoption, we deem the reliability of
this dataset to be satisfactory for our study.
We use books pages with at least 50 ‘likes’ from Face-
book users who also completed the personality question-
naire. This achieves more reliable personality profile esti-
mates of people who liked each book. In all, we analyze
479 books that were ‘liked’ by 61,662 users. For each of the
five personality dimensions, we took the median score of
all users who ‘liked’ the book as the aggregate personality
score for that book. Median scores were favored over mean
scores to reduce the influence of outliers.
2.2 Book Content
To e xt ra ct a b o ok ’s c on te n t, w e adopt a data-driven ap-
proach by mining user-generated tags from GoodReads, an
online book catalogue. When this study was conducted in
Fig. 1. Facebook ‘likes’ for various pages.
ANNALYN ET AL.: PREDICTING PERSONALITY FROM BOOK PREFERENCES WITH ONLINE USER-GENERATED DATA 3
2016, the site had more than 40 million users and more than
1.3 billion books. Users can label books with descriptive
tags, which cover a broad range of concepts like genre (e.g.
children’s-literature), time of publication (e.g. 20th-
century-fiction), story characters (e.g. Dumbledore), au-
thor information (e.g. British-author), awards (e.g. or-
ange-prize) and reading behavior (e.g., back-burner).
Due to their richness, we chose these user-generated tags
as a proxy for book content. However, some tags, such as
upstairs-bookshelves, appear to make sense only to a
small group of users. This calls for robust analysis tech-
niques that can withstand noise.
With permission from Goodreads, we crawl their site for
the 479 books in our personality dataset, and then harvest
the tags which Goodreads users had associated with these
books. Besides accounting for the books present in the per-
sonality dataset, we also identify the top 50 books associ-
ated with each of the top 2000 most frequently-used tags
across the whole catalogue. We t he n crawl all tags associ-
ated with these books.
Next, we match book titles from Goodreads to their re-
spective pages on Facebook, leading to a many-to-many re-
lationship. For example, the book Harry Potter and the Phi-
losopher’s Stone is matched not only to the Facebook page
of the same name, but also to a general page for the Harry
Potter series. At the same time, the general Facebook page
for Harry Potter is matched to all seven books in the series.
3 DATA PROCESSING
Goodreads users can create their own tags. While this pro-
vides a rich source of information, it also introduces noise
that poses several challenges for analysis. To o ve rc om e
these challenges, we employ several techniques.
First, we use a set of criteria to filter tags for analysis:
• For each book crawled, only tags that applied 3 times
or more are recorded.
• Tags applied less than 50 times in total and tags ap-
plied to less than 15 books are discarded.
• Tags must consist of at least 3 characters, at least 1
letter, and at most 2 non-English characters.
We u se these filtering criteria because they were
deemed via manual inspection to be effective at eliminat-
ing non-informative tags. After crawling and filtering, our
dataset contained 14,731 unique books, 24,091 unique tags
and 193,498,469 total tags.
Next, we identify four challenges in analyzing the tags
data:
• Information Value. Common tags (e.g. fiction,
book-club, and favorites) appear frequently across
many titles, and thus are not useful in distinguishing
between books.
• Synonyms. Some tags have identical or similar
meanings (e.g. children and kids), and hence need
to be analysed as one.
• Idiosyncrasies. Some tags are used whimsically. For
example, Harry Potter and the Philosopher’s Stone was
tagged as science more than 20 times.
• Random Noise. We expect a baseline level of random
noise. If a tag is applied to a book 20 times, and to
another for 21 times, this difference would likely be
due to random fluctuations rather than actual differ-
ence in content.
To d is ti ng ui sh i nf or ma ti ve t ag s, we u s e the term fre-
quency-inverse document frequency (tf-idf) measure. With
tf-idf, the frequency !
"#$ of a tag appearing in a book is dis-
counted by how common the tag % generally is. In other
words, common tags such as fiction and favourites are
discounted heavily to indicate their low information value.
Letting !
"#$ denote the frequency of tag % appearing in book
&, ' denote the total number of books, (" denote the num-
ber of unique books that tag % is applied to, and we have
)f-idf )# * + !
"#$ ,-. / 0 '
("
(1)
Using tf-idf, we can build a book-by-tag matrix, 1. In 1,
each row represents a book, each column represents a tag, and
each entry represents the corresponding tf-idf value.
Next, we group similar tags together. We do this by
combining results from two similarity measures.
The first similarity measure is derived from the co-oc-
currence of tags in books. That is, if two tags occur in sim-
ilar books, the tags are likely to share similar meanings and
belong to the same genre. We compute a low-rank approx-
imation of 1, matrix 1. Formally, we minimize the follow-
ing objective:
12+345
67 1 8 1 9 s.t. :;5 < 1 = >
(2)
where the > is the desired rank of 1 and ?9 is the Fro-
benius norm. The minimization is achieved using singular
value decomposition. Each tag % is represented as a column
vector @" in 1. The similarity between two tags % and %A is
then computed as the cosine of the angle in between:
similarity %# 7% A+@"? @"B7
@"@"B
(3)
Although the above captures co-occurrence between
tags, we also want to directly capture lexical similarity.
Thus, we derive a second similarity measure based on
shared words between tags (e.g. between historical-
novel and historical-fiction). Each word in a tag is first
lemmatized using ClearNLP [21]. As in co-occurrence sim-
ilarity, we compute a tag-by-word matrix using tf-idf to
discount frequent words, followed by a low-rank approxi-
mation of the matrix. Similarity between tags can be com-
puted as the cosine distance between row vectors in this
matrix.
Overall similarity is computed as a weighted sum of co-
occurrence-based (95% weight) and word-lemma-based
(5% weight) similarities. Then, we use the OPTICS cluster-
ing algorithm [22] to cluster similar tags together and to
discard tags that do not fit into any cluster. A round of
manual coding is performed to correct any errors in the
clustering, resulting in a total of 396 tag clusters, where
each tag cluster corresponds to a single semantic meaning.
Each tag cluster is then labelled with a semantically repre-
sentative tag as a label for book content, and henceforth
4 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
treated as a single tag for analysis against personality. Spe-
cifically, to consolidate tags belonging to the same cluster,
the median of their tf-idf values is used.
As a single Facebook page can contain multiple books
on Goodreads, we consolidate the book data by manually
mapping Goodreads books to Facebook pages. For each
book, we first normalize its feature vector comprising tf-
idf values to unit length. Next, feature vectors of books re-
ferred to by the same Facebook page are summed; the re-
sulting summed vector is then normalized to unit length
again.
4 ANALYSIS
We c on du ct a t wo -level analysis to examine how personal-
ity predicts book content preferences at the tag level and at
the genre level. We also analyze how personality could in-
fluence one’s general tendency to like books.
4.1 Tag-Level Analysis
We compute correlations between the tf-idf values of each
tag cluster and each of the Big Five personality dimensions.
Next, we perform lasso regression to predict personality
from tag cluster features. Unlike regular regression, lasso
regression maintains a higher prediction accuracy despite
correlations between features (i.e. multi-collinearity)
through regularization. For each personality trait, we use
the regularization coefficient yielding the lowest mean
squared test error from a 10-fold cross-validation.
We a ls o p er fo rm the same prediction using a random
forest regression with 500 trees. The technique involves
simulating different combinations of features in multiple
decision trees to select the best combination of features that
predicts personality. As determined by cross validation,
each tree utilizes 132 variables selected randomly. With the
random forest regression, we compute the importance of
each tag cluster feature based on the increase in mean
squared prediction error when that feature is removed.
4.1.1 Results
Ta bl e 1 shows the tag clusters that are most strongly corre-
lated with each personality trait. All correlations shown are
statistically significant at C D EFEG and most are significant
at C D EFEE/. The most positive correlation is between the
back-burner tag cluster and the openness trait (H + EFIJ),
while the most negative correlation is between the light-
fantasy tag cluster and openness (H + 8EFIK).
Ta bl e 2 shows the tag clusters with the biggest absolute
coefficients in the five lasso regression analyses predicting
scores for each personality trait. Based on the LM values
TABLE 1
TAG CLUSTERS MOST CORRELATED WITH PERSONALITY
H = correlation coefficient; * C D EFEG; ** C D EFE/; ***7C D EFEE/.
ANNALYN ET AL.: PREDICTING PERSONALITY FROM BOOK PREFERENCES WITH ONLINE USER-GENERATED DATA 5
from each analysis, book content seems best at predicting
scores on the openness trait.
Fig. 2 shows the tag clusters that result in the largest de-
creases in mean squared error in the five random forest re-
gression analyses predicting scores for each personality
trait. Green and red colors represent positive and negative
correlations respectively between clusters and traits. Note
that most of the top predictive tag clusters for agreeable-
ness have positive correlations with the trait.
Results from correlation, lasso regression and random
forest regression analyses are largely consistent. For exam-
ple, fantasy-sci-fiction has a strong negative correla-
tion with extraversion (Ta b le 1), and this is supported by
the lasso regression predicting extraversion, which shows
strong negative coefficients for fantasy settings such as
parallel-world and forgotten_realms, (Table 2). This is
again supported by random forest regression findings that
the second most important variable in predicting extraver-
sion is fantasy-sci-fiction (Fig. 2). Differences between
Ta bl es 1 an d 2 may be attributed to the use of L1-regulari-
zation in lasso. The regularization penalizes the number of
non-zero coefficients, forcing the algorithm to assign
weights to tags that are not strongly correlated with each
other.
While results from lasso and random forest regressions
are consistent, their LM values for each personality trait dif-
fer. For example, predictions of conscientiousness scores
have the lowest LM for lasso regression, but the second
highest LM for random forest regression. This difference
may be explained by the linearity constraint for lasso re-
gression–if the distribution of personality scores is non-lin-
ear, its LM in lasso regression may be affected.
4.1.2 Discussion
Overall, we find that book preferences can potentially be
used to predict personality traits:
Extraversion. As expected, our findings suggest that ex-
traverts enjoy books with social themes, as described by
the tags like relationships and chick lit. They also
seem interested to read about the lives of others, from
memoirs to celebrity romance. Curiously, preference for
African American literature is also associated with being
extraverted. This may be explained by African Americans
themselves being more extraverted than white Americans
[23]. Since we did not record race in our study, we cannot
rule out this explanation. On the other hand, introverts
seem to prefer books with themes such as fantasy, science
fiction, and supernatural forces, exhibiting a tendency to
indulge in imagination. Appreciation of Japanese culture,
especially manga and comics, is also associated with intro-
version. In general, book preferences explain a substantial
amount of variation in the extraversion dimension, con-
sistent with the consensus that extraversion is typically a
more salient trait to measure.
Agreeableness. Our findings suggest that agreeable
people enjoy books with family and religious themes, both
of which promote positive social relationships. On the flip
side, disagreeable individuals seem attracted to dark-
themed content such as psychological dramas. Cult clas-
sics, known for their controversial narratives, also seem ap-
pealing to these individuals who may have fewer qualms
about resisting popular opinion. Books with content re-
lated to Japan, Italy, and Russia are also read by people
who are less agreeable, possibly because people from these
cultures tend to score lower on agreeableness compared to
Americans [23]. Interestingly, most of the top tags predict-
ing agreeableness are positively correlated with the trait.
The absence of consistent tags endorsed by disagreeable
people suggests that these people also tend to disagree on
what they ‘liked’.
Openness. Open individuals seem to enjoy intellectu-
ally challenging books that the average person may find
difficult to complete (e.g. back-burners). Their preference
for classic literature further reinforces this view, as books
of this genre usually take substantial effort to finish. This
is consistent with past studies that found openness to be
TABLE 2
TOP LASSO REGRESSION COEFFICIENTS IN PERSONALITY
PREDICTION
LM = coefficient of determination, or the proportion of variance explained by
the lasso regression model; N = regression coefficient.
6 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
highly correlated with appreciation for art and literature
[13]. Our results also show that individuals scoring lower
on openness prefer mainstream content that are less cogni-
tively taxing and easier to digest, such as light-fantasy.
Content related to Christianity and India are also preferred
by readers with low openness, likely due to religious indi-
viduals [24] and Indians [23] scoring low on this trait.
Neuroticism. Neurotic individuals seem to indulge in
narratives that reflect their own emotional states, such as
Fig. 2. Top tags that resulted in largest decreases in mean squared error (MSE) in random forest regression analysis for each personality
trait. Colors show the correlation between tag and personality trait (red is negative while green is positive). (a) extraversion (LM+ EFOG); (b)
agreeableness (LM+ EFIP); (c) openness (LM+ EFPQ); (d) neuroticism (LM+ EFO/); (e) conscientiousness (LM+ EFOQ).
ANNALYN ET AL.: PREDICTING PERSONALITY FROM BOOK PREFERENCES WITH ONLINE USER-GENERATED DATA 7
sad endings and mental issues. They also appear to en-
joy books on alternative realities, in line with the hypothe-
sis that these genres provide a means of escape [4]. Inter-
estingly, neurotic individuals like books with pretty co-
vers, possibly due to a gender effect as females tend to
score higher in neuroticism than males [25]. On the other
hand, emotionally stable individuals prefer self-improve-
ment and other non-fiction content that better reflect real-
ity. In general, we found book preferences to be good pre-
dictors of neuroticism, explaining as much as 59% of the
variance in this dimension.
Conscientiousness. Hardworking people appear to
prefer informative content that contributes to their profes-
sional development, or that simply boosts their knowledge
[6], [15]. On the other hand, people with low conscientious-
ness scores tend to like lighthearted content (e.g. humor)
and books aimed at youths (e.g., teenage-books). This can
be explained by how teenagers tend to score lower than the
middle-aged in conscientiousness [26].
In sum, our results show how book preferences can be
used to predict one’s personality. Besides personality, our
findings also reveal cultural differences in book prefer-
ences, further supporting the utility of online, user-gener-
ated data in deducing more comprehensive profiles of tar-
get audiences.
4.2 Genre-Level Analysis
Conclusions from our tag-level analyses are based on finer
descriptions of book content rather than traditional genres.
To te st t he i nt eg ri ty o f t ag s a s b oo k co nt en t d escr ip to rs , we
further group tag clusters into broader genres, which are
then used to predict personality scores again.
To o bt ai n ge n re c lu st er s, w e c om pu te the Pearson’s cor-
relation between the tf-idf values of books as a proxy for
dissimilarity (i.e. distance) between books. Next, the books
are clustered using the Partitioning Around Medoids
(PAM) algorithm, a form of k-medoid clustering [27]. Like
k-means, PAM aims to minimize the distance between
cluster members and their respective cluster centers
through an iterative algorithm. Unlike k-means however,
PA M a ss ig ns a ct ua l da ta p oi n ts as cl ust er ce nt er s. Hence, it
is more robust to noise and outliers than k-means because
it minimizes the sum of pairwise dissimilarities rather than
the sum of squared Euclidean distances.
To de te r mine the optimal number of clusters, we use sil-
houette width, a measure for data points' similarity within
their assigned cluster against their similarity to points in
other clusters. For data point R, we let STRU denote the aver-
age distance between R and all other data points in the clus-
ter that R is assigned to, and &TRU be the lowest average dis-
tance of R to any other cluster. Silhouette VTRU is defined as
V R + & R 8 STRU
3;W7XS R # & R Y
(4)
We e xa mi ne d results for 4 to 30 clusters, and eventually
chose the 27-cluster solution as it yielded large mean and
median silhouette widths across all clusters. These clusters
also represented a diverse range of genres that enable com-
parison with past literature. The composition of each genre
in terms of tags, as well as the personality profile of each
genre, are presented in the following results section.
4.2.1 Results
Genres clusters are given labels that are representative of
their member tags. Top t ag s f ro m example clusters are
shown in Ta bl e 3. These are the tags that appear most fre-
quently in a genre relative to the entire dataset.
For each genre, we took the median personality scores
of all books in that genre cluster, thus generating an overall
personality profile for that genre. Fig. 3 shows the aggre-
gated personality profiles for all 27 genres. Size of pie chart
slices are normalized to zero mean and unit variance.
A pr incipal component analysis was performed on the
aggregated personality scores across genres, and we found
that the openness and conscientiousness traits captured
the most variation in genre profiles. Thus, for visualization
purposes, we plot book genres for these two dimensions in
Fig. 4.
TABLE 3
TOP TAGS IN EXAMPLE CLUSTERS
8 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
4.2.2 Discussion
Results from the genre-level analysis are consistent with
findings from both existing literature and our tag-level
analysis. For instance, people who like Self-improvement
books are more conscientious than those who like Comics,
and people who like Philosophy books are more open than
those who like Religious books.
There is one exception, however. While a previous
study [6] found that extraverts like horror, our findings
suggest the opposite– books with horror themes seem to
appeal more to introverts. This discrepancy may be due to
the mode of narrative: while our study focuses on books,
the other study had examined television shows in addition
to books. While horror TV shows may be highly stimulat-
ing, the arousal may be muted in books, explaining the
lower preference for horror books among our extraverted
respondents.
Apart from confirming our earlier results, genre clusters
also give new insights. For example, the Thriller cluster
contains detective and legal elements, which require criti-
cal thinking and perhaps even background knowledge on
law for a reader to fully appreciate the plot. This may ex-
plain why readers who like mystery books also score
higher on conscientiousness. Another interesting observa-
tion is that the Classics cluster has an average profile for all
five personality traits. This cluster contains time-honored
and household favorites, which would have appealed to
most people regardless of personality, thus resulting in a
profile that reflects the sample average.
Cultural differences are also apparent. People who like
Asian books are less open [23], consistent with results from
our tag-level analysis. However, people who like Asian
books are also relatively extraverted, which runs contrary
to claims that Asians are more introverted [28]. This dis-
crepancy may be due to Facebook being more attractive to
extraverted individuals in the first place [29], thus result-
ing in a more extraverted Asian user base.
We h ave sh ow n how personality profiles of readers can
be inferred from their preferred content, at both the tag and
genre level. A detailed tag-level analysis can provide more
resolution on book content, while a broader genre-level
analysis can identify associations between tags.
4.3 General Reading Disposition
Since personality has been found to correlate with book
preferences, it may also correlate with the tendency to like
books in the first place. To examine this, we compute cor-
relations between users’ personality scores and the num-
ber of book pages they ‘liked’.
It turns out that correlations are very weak (r’s < 0.06)
across four of the five traits: conscientiousness, extraver-
sion, neuroticism and, importantly, agreeableness. Alt-
hough previous studies found that agreeable people tend
to evaluate content favorably [10], our study finds a near-
zero correlation between agreeableness and the number of
books ‘liked’ on Facebook (r = -0.02). A possible explana-
tion may be that while agreeable people are less likely to
express dislike to avoid disagreements, they may nonethe-
less only ‘like’ a book when they genuinely enjoy the con-
tent.
The openness trait, on the other hand, is a relatively
strong and significant predictor (r = 0.12, p < 0.001) of num-
ber of books ‘liked’. This result lends support to our earlier
hypothesis: Open individuals appreciate a wider variety of
books, and thus ‘like’ more books on average.
5 LIMITATIONS
We ac kn ow le dg e t ha t re li an ce on we b d ata may lead to a
few limitations. First, only popular books that had a Face-
book page with sufficient ‘likes’ are included in the analy-
sis. Hence, newer or niche books on the heavy tail of a book
popularity distribution may be overlooked. Second, Face-
book users have been found to be more extraverted, more
narcissistic, and less conscientious than average, and hence
they may not be representative of the general population
[29]. Third, Facebook ‘likes’ may be driven by the need for
social acceptance or recognition [30], [31], and thus may
not be a faithful reflection of a person’s preferences.
However, because our findings are in line with existing
literature, the above concerns are unlikely to have been sig-
nificant enough to skew results. In fact, despite sources of
noise and idiosyncrasies, user-generated tags have proven
to be a rich well of information that not only enabled us to
dive deeper into sub-genre preferences, but also to explore
broader preference-related behaviors.
6 CONCLUSIONS
Findings from our study are consistent across both tag and
genre levels of analyses, and also in line with existing liter-
ature, thus demonstrating the utility of online user-gener-
ated data in profiling target audiences. Besides predicting
personality from book preferences, user tags allow us to
uncover unexpected insights, such as cultural differences,
Fig. 4. Genres on the Openness-Conscientiousness dimensions.
ANNALYN ET AL.: PREDICTING PERSONALITY FROM BOOK PREFERENCES WITH ONLINE USER-GENERATED DATA 9
book reading behaviors (e.g. ‘back-burner’), and other
non-content factors affecting preferences (e.g. ‘pretty co-
vers’).
Future research can incorporate additional dimensions
such as year of publication, which may allow us to track
the evolution of genres. For instance, vanilla love stories in
the romance genre seem to be increasingly overtaken by
vampire-related themes, with series such as Tw ilight (2005-
2008) and Vam pire Aca demy (2007-2010). Trends like these
may be overlooked if books are analyzed by genre instead
of tags. Another possible avenue of research may be to ex-
amine popular combinations of tags within books. Find-
ings may help authors identify unique tag combinations to
spin fresh story plots.
With growing online activity, we believe that large, user-
generated datasets, as well as the ability to parse them ef-
fectively, can play an important role in the study of arts
and social sciences fields, such as literature, psychology,
and marketing.
ACKNOWLEDGMENT
The authors thank Dr. Michal Kosinski for his valuable
feedback, and Goodreads.com for allowing us to crawl for
data.
REFERENCES
[1] M. Djikic, K. Oatley, S. Zoeterman, and J. B. Peterson, "On being
moved by art: How reading fiction transforms the self," Creativity Re-
search Journal, vol. 21, no. 1, pp. 24–29, Feb. 2009.
[2] R. A. Mar, K. Oatley, and J. B. Peterson, "Exploring the link between
reading fiction and empathy: Ruling out individual differences and ex-
amining outcomes," Communications, vol. 34, no. 4, Jan. 2009.
[3] W. C. Tirre and S. Dixit, "Reading interests: Their dimensionality and
correlation with personality and cognitive factors," Personality and In-
dividual Differences, vol. 18, no. 6, pp. 731–738, Jun. 1995.
[4] G. Kraaykamp and K. van Eijck, "Personality, media preferences, and
cultural participation," Personality and Individual Differences, vol. 38,
no. 7, pp. 1675–1688, May 2005.
[5] I. Cantador, I. Fernández-Tobías, and A. Bellogín, "Relating personal-
ity types with user preferences in multiple entertainment domains,"
CEUR Workshop Proceedings, vol. 997, 2013.
Fig. 3. Personality profiles of genres.
10 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, MANUSCRIPT ID
[6] P. Rentfrow, L. Goldberg, and R. Zilca, "Listening, watching, and read-
ing: The structure and correlates of entertainment preferences," Journal
of personality., vol. 79, no. 2, pp. 223–58, Jul. 2010.
[7] M. Galesic and M. Bosnjak, "Effects of questionnaire length on partic-
ipation and indicators of response quality in a web survey," Public
Opinion Quarterly, vol. 73, no. 2, pp. 349–360, Jan. 2009.
[8] M. Kosinski, D. Stillwell, T. Graepel, "Private traits and attributes are
predictable from digital records of human behavior," Proceedings of the
National Academy of Sciences, vol. 110, no. 15, pp. 5802–5805, Sep.
2013.
[9] P. T. Costa, R. R. McCrae, “The revised neo personality inventory
(NEO-PI-R),” The SAGE Handbook of Personality Theory and Assess-
ment, vol. 2, pp. 179–198, 2008.
[10] J. B. Weaver, H.-B. Brosius, and N. Mundorf, "Personality and movie
preferences: A comparison of American and German audiences," Per-
sonality and Individual Differences, vol. 14, no. 2, pp. 307–315, Feb.
1993.
[11] M. T. Soto-Sanfiel, L. Aymerich Franch, and E. Romero, "Personality
in interaction: How the big Five relate to the reception of interactive
narratives," Comunicación y sociedad = Communication & Society,
vol. 27, no. 3, pp. 151–186, 2014.
[12] A. Furnham and J. Walker, "Personality and judgements of abstract,
pop art, and representational paintings," European Journal of Personal-
ity, vol. 15, no. 1, pp. 57–72, Jan. 2001.
[13] I. C. McManus and A. Furnham, "Aesthetic activities and aesthetic at-
titudes: Influences of education, background and personality on interest
and involvement in the arts," British Journal of Psychology, vol. 97, no.
4, pp. 555–587, Nov. 2006.
[14] J. C. Conway and A. M. Rubin, "Psychological predictors of television
viewing motivation," Communication Research, vol. 18, no. 4, pp. 443–
463, Aug. 1991.
[15] A. S. Gerber, G. A. Huber, D. Doherty, and C. M. Dowling, "Personal-
ity traits and the consumption of political information," American Pol-
itics Research, vol. 39, no. 1, pp. 32–84, Sep. 2010.
[16] L. R. Goldberg et al., "The international personality item pool and the
future of public-domain personality measures," Journal of Research in
Personality, vol. 40, no. 1, pp. 84–96, Feb. 2006.
[17] S. C. Rife, K. L. Cate, M. Kosinski, and D. Stillwell, "Participant re-
cruitment and data collection through Facebook: The role of personality
factors," International Journal of Social Research Methodology, pp. 1–
15, Sep. 2014.
[18] M. Kosinski, Y. Bachrach, P. Kohli, D. Stillwell, and T. Graepel, "Man-
ifestations of user personality in website choice and behaviour on online
social networks," Machine Learning, vol. 95, no. 3, pp. 357–380, Jan.
2014.
[19] G. Park et al., "Automatic personality assessment through social media
language," Journal of Personality and Social Psychology, vol. 108, no.
6, pp. 934–952, 2015.
[20] W. Youyou, M. Kosinski, and D. Stillwell, "Computer-based personal-
ity judgments are more accurate than those made by humans," Proceed-
ings of the National Academy of Sciences, vol. 112, no. 4, pp. 1036–
1040, Jan. 2015.
[21] J. D. Choi and M. Palmer, "Fast and robust part-of-speech tagging using
dynamic model selection," Association for Computational Linguistics,
2012, pp. 363–367.
[22] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, "OPTICS:
ordering points to identify the clustering structure," ACM SIGMOD
Record, vol. 28, no. 2, pp. 49–60, Jan. 1999.
[23] J. Allik and R. R. McCrae, "Toward a geography of personality traits:
Patterns of profiles across 36 cultures," Journal of Cross-Cultural Psy-
chology, vol. 35, no. 1, pp. 13–28, Jan. 2004.
[24] V. Saroglou, "Religion and the five factors of personality: A meta-ana-
lytic review," Personality and Individual Differences, vol. 32, no. 1, pp.
15–25, Jan. 2002.
[25] D. P. Schmitt, A. Realo, M. Voracek, and J. Allik, "Why can’t a man
be more like a woman? Sex differences in big Five personality traits
across 55 cultures," Journal of Personality and Social Psychology, vol.
94, no. 1, pp. 168–182, 2008.
[26] M. B. Donnellan and R. E. Lucas, "Age differences in the big five
across the life span: Evidence from two national samples," Psychology
and Aging, vol. 23, no. 3, pp. 558–566, 2008.
[27] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro-
duction to Cluster Analysis (Wiley Series in Probability and Statistics).
John Wiley & Sons, 2008, ch. 2, pp. 68–125.
[28] D. P. Schmitt, J. Allik, R. R. McCrae, and V. Benet-Martinez, "The
geographic distribution of big Five personality traits: Patterns and pro-
files of human self-description across 56 nations," Journal of Cross-
Cultural Psychology, vol. 38, no. 2, pp. 173–212, Mar. 2007.
[29] T. Ryan and S. Xenos, "Who uses Facebook? An investigation into the
relationship between the big Five, shyness, narcissism, loneliness, and
Facebook usage," Computers in Human Behavior, vol. 27, no. 5, pp.
1658–1664, Sep. 2011.
[30] H. Gangadharbatla, "Facebook me: Collective self-esteem, need to be-
long, and Internet self-efficacy as predictors of the igeneration’s atti-
tudes toward social networking sites," Journal of Interactive Advertis-
ing, vol. 8, no. 2, pp. 5–15, Mar. 2008.
[31] L. E. Buffardi and W. K. Campbell, "Narcissism and social networking
web sites," Personality and Social Psychology Bulletin, vol. 34, no. 10,
pp. 1303–1314, Jul. 2008.
Ng Annalyn was a research associate with Disney Research Pitts-
burgh and is currently employed by Singapore’s Ministry of Defence.
She received her M.Phil. degree in Psychology with the University of
Cambridge, where she mined consumer data for targeted advertising
and programmed cognitive tests for job recruitment with the Cam-
bridge Psychometrics Centre. She has a B.Sc. in Psychology and
Economics from the University of Michigan (Ann Arbor), where she
was also an undergraduate statistics tutor. Her research interests in-
clude machine learning applications in social sciences. She is the au-
thor of the book: “Numsense! Data Science for the Layman”.
Maarten W. Bos received his MS degree in Social Psychology from
the University of Amsterdam, and his PhD degree from the Radboud
University in The Netherlands. After a postdoctoral fellowship at the
Harvard Business School, he is currently a Research Scientist at Dis-
ney Research. His research interests include decision science and
behavioral economics. He has high impact publications in decision
science, and he is a member of the Society for Personality and Social
Psychology, the Association for Psychological Science, and the Soci-
ety for Judgment and Decision Making.
Leonid Sigal is a Senior Research Scientist at Disney Research Pitts-
burgh and an adjunct faculty at Carnegie Mellon University. Prior to
this he was a postdoctoral fellow in the Department of Computer Sci-
ence at University of Toronto. He completed his Ph.D. at Brown Uni-
versity in 2008; he received his B.Sc. degrees in Computer Sci- ence
and Mathematics from Boston University (1999), his M.A. from Boston
University (1999), and his M.S. from Brown University (2003). From
1999 to 2001, he worked as a senior vision engineer at Cognex Cor-
poration, where he developed industrial vision applications for pattern
analysis and verification. Leonid’s research interests mainly lie in the
areas of computer vision, machine learning, and computer graphics.
He has published more than 50 peer reviewed papers in venues and
journals in in these fields (including publications in PAMI, IJCV, CVPR,
ICCV, ECCV, NIPS, UAI, and ACM SIGGRAPH). His work received
the Best Paper Awards at the AMDO conference in 2006 / 2012 and
at WACV in 2014. He has also coedited the book Guide to Visual An-
alytics of Humans: Looking at People (Springer, 2011).
Boyang "Albert" Li is a Research Scientist at Disney Research,
where he directs the Narrative Intelligence group. He obtained his
Ph.D. in Computer Science from Georgia Institute of Technology in
2014, and his B. Eng. from Nanyang Technological University, Singa-
pore in 2008. His research interests include computational narrative
intelligence, or the creation of Artificial Intelligence that can under-
stand, craft, tell, direct, and respond appropriately to narratives, and
understanding how human cognition comprehends narratives and
produces narrative-related affects. He has authored and co-authored
more than 30 peer-reviewed papers in international journals and con-
ferences.