Conference PaperPDF Available

DLATK: Differential Language Analysis ToolKit

Authors:

Figures

Content may be subject to copyright.
Proceedings of the 2017 EMNLP System Demonstrations, pages 55–60
Copenhagen, Denmark, September 7–11, 2017. c
2017 Association for Computational Linguistics
DLATK: Differential Language Analysis ToolKit
H. Andrew SchwartzSalvatore GiorgiMaarten Sap§
Patrick CrutchleykJohannes C. EichstaedtLyle Ungar
Stony Brook University University of Pennsylvania
§University of Washington kQntfy
has@cs.stonybrook.edu,sgiorgi@sas.upenn.edu
Abstract
We present Differential Language Anal-
ysis Toolkit (DLATK), an open-source
python package and command-line tool
developed for conducting social-scientific
language analyses. While DLATK pro-
vides standard NLP pipeline steps such
as tokenization or SVM-classification, its
novel strengths lie in analyses useful
for psychological, health, and social sci-
ence: (1) incorporation of extra-linguistic
structured information, (2) specified lev-
els and units of analysis (e.g. docu-
ment, user, community), (3) statistical
metrics for continuous outcomes, and (4)
robust, proven, and accurate pipelines
for social-scientific prediction problems.
DLATK integrates multiple popular pack-
ages (SKLearn, Mallet), enables interac-
tive usage (Jupyter Notebooks), and gen-
erally follows object oriented principles to
make it easy to tie in additional libraries or
storage technologies.
1 Introduction
The growth of NLP for social and medical sci-
ences has shifted attention in NLP research from
understanding language itself (e.g. syntactic pars-
ing or characterizing morphology) to understand-
ing how language use characterizes people (e.g.
by correlating language use characteristics with
traits of the person producing the language). Much
of this work has been done using Facebook and
Twitter (Coppersmith et al.,2014).
Analyzing language for social science applica-
tions requires different tools and techniques than
conventional NLP. Structured data are often ben-
eficial to facilitate the use of the extensive extra-
linguistic information such as the time and loca-
tion of the post and author demographics (or even
health or school records). Models can be made at
multiple levels of analysis: documents, users, and
different geographic (zip code, state or country) or
temporal resolutions. Many of the outcomes (or
dependent variables) are continuous (e.g. scores
on personality tests), and researchers are often as
interested in interpretable insights as they are with
predictive accuracy (Kern et al.,2014a).
There are small “tricks” to obtain accurate pre-
dictive models or high correlations between lan-
guage features and outcomes. Emoticon-aware to-
kenizers are needed, robust methods for creating
LDA topics (different packages produce clusters
of strikingly different quality), and subtle issues of
regularization arise when combining demographic
and language features in models. When these
choices are combined with the complexity of the
structured data, even NLP and data scientists can
fail to produce high quality models. We there-
fore built a platform that integrates a variety of
open-sourced tools, alongside our “tricks” and op-
timizations, to provide a well-documented, easy-
to-use program for undertaking reproducible re-
search in the area of NLP for the social sciences.
This software, which has now been used for
the data analysis behind 32 papers in psychology,
health care, and NLP, is now available under a
GPLv3 software license.1
2 Overall Framework
The core of DLATK is a Python library depicted
in Figure 1. The base class, DLAWorker, sits on
top of a data engine (e.g. MySQL, HDFS/Spark)
and is used to track corpus basics (corpus loca-
tion, unit-of-analysis). The next level of classes
acts on either: messages, features or outcomes.
MessageAnnotator filters messages (removing du-
1http://dlatk.wwbp.org or http://github.com/dlatk/
55
Filters the corpus (language
filtering, removing duplicates)
MessageAnnotator
acts on the corpus’ text (parsing,
sentence segmentation, ...)
MessageTransformer
base generic class: works with a
corpus given a unit of analysis
(e.g. user_id, tweet_id)
DLAWorker
instantiates dlatk objects for
interactive use (e.g. creates pandas
data frames)
FeatureStar
extracts continuous or discrete
variables from the corpus
FeatureExtractor
works with language features
FeatureGetter
works with extra-linguistic
information
OutcomeGetter
Data Engine
corpus
extra-
linguistics
unit of analysis
analyzes extra-linguistic
information joint with features
(DLA, mediation analysis, …)
OutcomeAnalyzer
filters sets of language features
(PMI, tf-idf, ...)
FeatureRefiner
extracts features from semantic
annotations (semantic roles,
named entities, …)
SemanticsExtractor
performs topic modeling or works
with packages (Mallet) to perform
topic modeling
TopicsExtractor
performs dimensionality reduction
on features and outcomes
(PCA, CCA, …)
DimensionReducer
performs classification of binary
outcomes given language features
and controls
ClassifyPredictor
performs prediction of continuous
outcomes given language features
and controls
RegressionPredictor
Classification and Prediction
outcomes, controls
Figure 1: Basic DLATK package class structure.
plicate tweets, language filtering, etc.) while Mes-
sageTransformer acts on message text (tokeniz-
ing, part of speech tagging, etc.). FeatureExtrac-
tor converts document text to features (ngrams,
character ngrams, etc.) and is responsible for
writing while FeatureGetters read for downstream
analysis or further refinement via FeatureRefiner.
OutcomeGetter reads outcome tables (i.e., extra-
linguistic information). Its child class Outcome-
Analyzer works with both linguistic and extra-
linguistic information for statistical analyses (cor-
relation, logistic regression, etc.) and various out-
puts (wordclouds, correlation matrices, etc.).
The bottom classes do not inherit but are utiliz-
ers of FeatureGetters and OutcomeGetters. This
includes two classes for prediction, Regression-
Predictor and ClassifyPredictor which carry out
machine learning tasks: cross validation, feature
selection, training models, building data-driven
lexica, etc, while DimensionReducer provides un-
supervised transformations on language and out-
comes Finally, the FeatureStar class (“star” for
wildcard) is used to interact with the other classes
and transform important information into conve-
nient data structures (e.g. Pandas dataframes).
3 Differential Language Analyses
The prototypical use of DLATK is to perform dif-
ferential language analysis the identification of
linguistic features which either (a) independently
explain the most variance for continuous outcomes
or (b) are individually most predictive of discrete
outcomes (Schwartz et al.,2013b). Unlike predic-
tive techniques where one seeks to produce out-
come(s) given language (discussed next), here, the
goal is to produce language that is most related
to or independently discriminant of outcomes.2
DLATK supports several metrics for performing
differential language analysis.
Continuous DLA Metrics. We support a vari-
ety of metrics for comparing language to contin-
uous outcomes (e.g. age, degree of depression,
personality factor scores, income). Primary met-
rics are based on Pearson Product-Moment Cor-
relation Coefficient (Agresti and Finlay,2008).
When one requests control variables (e.g. finding
the relationship with degree of depression, con-
trolling for age and gender) then ordinary least
squares linear regression is used (Rao,2009)
wherein the control variables are included along-
side the linguistic variable as covariates and the
outcome is the dependent variable.
Discrete DLA Metrics. While linear regression
produces meaningful results for most situations, it
is often ideal to use other metrics for discrete or
Bernoulli outcomes. Logistic regression can be
used in place of linear regression where, by as-
suming a dichotomous outcome, statistical signif-
icance tests are usually more accurate (Menard,
2002). Where controls are not needed, there are
many other options, often less computationally
complex, such as TF-IDF,Informative Dirich-
let Prior,3or classification accuracy metrics like
2Even in basic prediction methods, like linear regression,
the relationship between each linguistic feature and the out-
come is complex dependent on the covariance structure
between all the variables. DLA works in a univariate, per-
feature fashion or with a limited set of control variables (e.g.
age and gender when discriminating personality).
3Bayesian approach to log-odds (Monroe et al.,2008).
56
Area Under the ROC Curve (Fawcett,2006).
Multiple Hypothesis Testing. Most of the met-
rics have a corresponding standard significance
test (e.g. Student’s t-test for Pearson correlation
and OLS regression), and most output confidence
intervals by default. Permutation testing has been
implemented for many of metrics without standard
significance tests, such as AUC-ROC, with the lin-
guistic feature vector shuffled relative to outcome
(and controls, if applicable) multiple times to cre-
ate a null distribution. Standard practice in differ-
ential language analysis (Schwartz et al.,2013b)
is to correlate each of potentially thousands of sin-
gle features (e.g., normalized usage of one single-
or multi-word expression) with a given outcome.
Thus, correcting for multiple comparisons is crit-
ical. When used through the interface script,
DLATK by default corrects for multiple compar-
isons using the Benjamini-Hochberg method of
FDR correction (Benjamini and Hochberg,1995).
Other options, such as the more conservative Bon-
ferroni correction (Dunn,1961) are also available.
4 Predictive Methods
As with traditional NLP, many social-scientific
research objectives can be framed as prediction
tasks, in which a model is fit to language features
to predict an outcome. DLATK implements many
available regression and classification tools, sup-
plemented with feature selection functions for re-
fining the feature space. A wide range of feature
selection techniques have been empirically refined
for accurate use in regression problems.
Feature selection. DLATK’s ClassifyPredictor
and RegressionPredictor classes include methods
for feature selection, which is critical given what
may be a very large space of linguistic features,
e.g., 100s of thousands of 1- to 3-grams in a cor-
pus. Both classes allow for pass-through of scikit-
learn Pipelines (e.g. univariate feature selection
based on feature correlation with outcome and
family-wise error) and dimensionality reduction
methods (e.g., PCA on feature matrix), including
combination methods where FS and DR steps are
applied to the original data in a serial manner.
Regression Models. DLATK supports a variety
of regression models in order to take in features
as well as extra-linguistic information and out-
put a continuous value predictions. These include
variants on penalized linear regression: Ridge,
Lasso,Elastic-Net, as well as non-linear tech-
niques such as Extremely Random Forests. A
common pipeline, referred to as “magic sauce” ap-
plies univariate feature selection and PCA to lin-
guistic features independent of controls, and then
uses ridge to fit a linear model from a combined
reduced space to the outcomes.
Classification Models. DLATK implements a
rich variety of classifiers, including Logistic Re-
gression and Support Vector Classifiers with L1
and L2regularization, as well ensemble and gradi-
ent boosting techniques such as Extremely Ran-
domized Trees. As with regression, techniques
have been setup so as to leverage extra-linguistic
information effectively either as additional predic-
tors or controls to try to “-out-predict”.
5 Notable Functionality
Linguistic information Because DLATK was
designed to exploit the full power of social me-
dia, a special emoticon-aware tokenizer is used
while also leveraging Python’s unicode capabili-
ties. Though not specifically designed to be lan-
guage independent, DLATK has been used in one
non-English study (Smith et al.,2016).
Extra-linguistic information. Most functional-
ity in DLATK is designed with extra-linguistic,
also referred to as “outcomes”, in mind. Such
information ranges from meta-information of so-
cial media posts, such as time or location, to user
attributes such as demographics or strong base-
lines one may wish to out-predict. For DLA, this
means that one not only distinguishes target extra-
linguistic information, but that controls are avail-
able. For prediction, extra-linguistic information
can be incorporated as input to a model, taking
into account the fact that such features are often
less sparse and more reliable features of people
than individual linguistic features.
Multiple Levels of Analysis. DLATK allows
one to work with a single corpus at multiple lev-
els of analysis, simply as a parameter to any ac-
tion. For example, one may choose to analyze
tweets themselves or group them by user id, lo-
cation, or even a combination of user and date.
Extra-linguistic information often dictates partic-
ular levels of analyses (e.g. community level mor-
tality rates or user-level personality questionnaire
responses). Analysis setups are flexible for lev-
els of analysis for example, one can dynamically
57
threshold which of the units of analyses are avail-
able (e.g. only include users with at least 1000
words or counties with 50,000 words).
Integration of Popular Packages. DLATK sits
on top of many popular open source packages used
for data analysis and machine learning (scikit-
learn (Pedregosa et al.,2011) and statsmodels
(Seabold and Perktold,2010)) as well as NLP spe-
cific packages (Stanford parser (Chen and Man-
ning,2014), TweetNLP (Gimpel et al.,2011) and
NLTK (Loper and Bird,2002)). LDA topics can
be created with the Mallet (McCallum,2002) in-
terface. After creation these topics can then be
used downstream in any standard DLATK analysis
pipeline. The pip and conda package management
systems control python library dependencies.
Interactive Usage. The standard way to interact
with DLATK is with the interface script through
the command line. Often users will only see the
two end points (the document input and the anal-
ysis output) and as a result this package is used as
a “black box”. In order to encourage data explo-
ration the FeatureStar class converts the language
features and extra-linguistic information into Pan-
das dataframes (McKinney,2011) allowing users
to import our methods into existing code. Sample
use cases include opening up predictive models to
explore feature coefficients and easily reading lin-
guistic data into standard data visualization tools.
Visualization. When running DLA we often run
separate correlations over tens of thousands of
language features. While a single word might
not give us considerable insight into our extra-
linguistic information groups of words taken to-
gether can often tell a compelling story. To this
end DLATK offers wordcloud output in the form
of n-gram and topic clouds images. Figure 2
shows 1- to 3-grams significantly correlated with
(a) age (positive; higher age), (b) age (negative;
lower age), (c) educator occupation and (d) tech-
nology occupation. This was run over the Blog
Authorship Corpus (Schler et al.,2006) packaged
with DLATK. Here color represents the words fre-
quency in the corpus (grey to red for infrequent to
frequent) and size represents correlation strength.
Comparison to social-scientific tools. Tradi-
tional programs for text analysis in the social sci-
ences are based on dictionaries (list of words asso-
ciated with a particular psychological ‘construct’
(a) Age (pos) (b) Age (neg)
(c) Educator (d) Technology Worker
Figure 2: 1- to 3-grams correlated with age and
occupation class.
or language categories, such as ‘positive emotion’
or references to work and occupational terms).
According to citations, the most popular tool is
Linguistic Inquiry and Word Count (LIWC) (Pen-
nebaker et al.,2015), followed by DICTION
(Hart,1984) and the General Inquirer (Stone et al.,
1966). For a given document, these programs pro-
vide the relative frequency of occurrence of terms
from the dictionaries. The use of dictionaries has
the advantage that they provide relatively parsimo-
nious language in a given text sample, and that the
results are in principle comparable across studies.
DLATK also reproduces the functionality of these
dictionary-based approaches. Dictionaries, how-
ever, are often opaque units of analysis, as their
overall frequency counts are determined by a few
highly frequent words. If these words are ambigu-
ous, interpretations of dictionary-based results can
be misleading (Schwartz et al.,2013a). DLATK
allows for the determination of words which drive
a given dictionary category, and it can also pro-
duce data-driven lexica based on predictive mod-
els over ngrams, or even find, within a given dic-
tionary category, the words most associated with.
DLATK can provide researchers with enough
information to generate hypotheses and clarify the
“nomological net” of a construct (Cronbach and
Meehl,1955); That is, help identify the psycho-
logical and social processes and constructs that re-
late to (are sufficiently correlated with) the out-
come under investigation. Further, the fact that
DLATK incorporates language features and con-
trols in prediction tests allows the researcher to
gauge how much construct-related variance is cap-
tured in language compared to meaningful demo-
58
graphic or socioeconomic baselines.
6 Evaluations
DLATK has been used as a data analysis plat-
form in over 30 peer-reviewed publications, with
venues ranging from general-interest (PLoS ONE:
Schwartz et al.,2013b) to computer science meth-
ods proceedings (EMNLP: Sap et al.,2014) to
psychology journals (JPSP: Park et al.,2015).
The most straightforward use for DLATK is
to provide insight on linguistic features asso-
ciated with a given outcome, the differential
language analyses presented in Schwartz et al.
(2013b). Other works to primarily use DLATK
for correlation-type analyses examine outcomes
like age (Kern et al.,2014b), gendered language
and stereotypes (Park et al.,2016;Carpenter et al.,
2016b), and efficacy of app-based well-being in-
terventions (Carpenter et al.,2016a).
Another area one can evaluate the utility of
DLATK is in building predictive models. Table
1summarizes some predictive models reported in
peer-reviewed publications. DLATK works to cre-
ate models at multiple scales, i.e., for predict-
ing aspects of single messages (e.g., tweet-wise
temporal orientation; Schwartz et al.,2015), or
predicting user-level attributes (e.g., severity of
depression; Schwartz et al.,2014), or predicting
community-level health outcomes (e.g., heart dis-
ease mortality; Eichstaedt et al.,2015).
7 Conclusion
DLATK has been under development for over ve
years. We have discussed some of its core func-
tionality, including support for extra-linguistic
features, multiple levels of analysis, and contin-
uous variables. However, its biggest benefits may
be flexibility and reliability due to many years of
refinement over dozens of projects. We aspire for
DLATK to serve as a multipurpose Swiss Army
Knife for the researcher who is trying to under-
stand the manifestations of social, psychological
and health factors in the lives of language users.
Acknowledgments
This work was supported, in part, by the Templeton Religion
Trust (grant TRT-0048). DLATK is an open-source project
out of the University of Pennsylvania and Stony Brook Uni-
versity. We wish to thank all those who have contributed
to its development, including, but not limited to: Youngseo
Son, Mohammadzaman Zamani, Sneha Jha, Megha Agrawal,
Margaret Kern, Gregory Park, Lukasz Dziuzinski, Phillip Lu,
Outcome Score Source
Demographic (user-level)
Age R= 0.83 Sap et al.
(2014)Gender Acc = 0.92
Big-Five Personality (user-level)
Openness R= 0.43
Park et al.
(2015)
Conscientiousness R= 0.37
Extraversion R= 0.42
Agreeableness R= 0.35
Neuroticism R= 0.35
Temporal orientation (message-level)
3-way classif Acc = 0.72 Schwartz et al.
(2015)
Intensity & affect (message-level)
Intensity R= 0.85 Preot¸iuc-Pietro
et al. (2016)Affect R= 0.65
Mental health (user-level)
PTSD AUC = 0.86 Preot¸iuc-Pietro
et al. (2015)Depression AUC = 0.87
Degree of dprssn R= 0.39 Schwartz et al.
(2014)
Physical health (US county-level)
Heart disease mor-
tality
R= 0.42 Eichstaedt
et al. (2015)
Table 1: Survey of predictive model scores
trained using DLATK in peer-reviewed publica-
tions. Scores reported are: R: Pearson correlation;
Acc: accuracy; AUC: area under the ROC curve.
Thomas Apicella, Masoud Rouhizadeh, Daniel Rieman, Se-
lah Lynch and Daniel Preot¸iuc-Pietro.
References
Alan Agresti and Barbara Finlay. 2008. Statistical
Methods for the Social Sciences. Allyn & Bacon,
Incorporated.
Yoav Benjamini and Yosef Hochberg. 1995. Control-
ling the false discovery rate: a practical and power-
ful approach to multiple testing. Journal of the royal
statistical society. Series B (Methodological), pages
289–300.
Jordan Carpenter, P. Crutchley, R. D. Zilca, H. A.
Schwartz, L. K. Smith, A. M. Cobb, and A. C. Parks.
2016a. Seeing the “big” picture: Big data methods
for exploring relationships between usage, language,
and outcome in internet intervention data. Journal
of Medical Internet Research, 18(8).
Jordan Carpenter, D. Preot¸iuc-Pietro, L. Flekova,
S. Giorgi, C. Hagan, M. Kern, A. Buffone, L. Ungar,
and M. Seligman. 2016b. Real Men don’t say ’cute’:
Using Automatic Language Analysis to Isolate Inac-
curate Aspects of Stereotypes. Social Psychological
and Personality Science.
Danqi Chen and Christopher D Manning. 2014. A fast
and accurate dependency parser using neural net-
works. In EMNLP.
Glen Coppersmith, Mark Dredze, and Craig Harman.
2014. Quantifying mental health signals in twitter.
ACL 2014, 51.
Lee J Cronbach and Paul E Meehl. 1955. Construct va-
lidity in psychological tests. Psychological bulletin,
52(4):281.
Olive Jean Dunn. 1961. Multiple comparisons among
59
means.Journal of the American Statistical Associa-
tion, 56(293):52–64.
Johannes C Eichstaedt, H. A. Schwartz, M. L. Kern,
G. Park, D. R. Labarthe, R. M. Merchant, S. Jha,
M. Agrawal, L. A. Dziurzynski, M. Sap, C. Weeg,
E. E. Larson, L. H. Ungar, and M. Seligman. 2015.
Psychological language on Twitter predicts county-
level heart disease mortality. Psychological Science,
26:159–169.
Tom Fawcett. 2006. An introduction to ROC analysis.
Pattern recognition letters, 27(8):861–874.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
and Noah A Smith. 2011. Part-of-speech tagging
for twitter: Annotation, features, and experiments.
In ACL, pages 42–47.
Roderick P Hart. 1984. Verbal style and the presi-
dency: A computer-based analysis. Academic Pr.
Margaret L Kern, J. C. Eichstaedt, H. A. Schwartz,
L. Dziurzynski, L. H. Ungar, D. J. Stillwell,
M. Kosinski, S. M. Ramones, and M. Seligman.
2014a. The online social self: An open vocabulary
approach to personality. Assessment, 21:158–169.
Margaret L Kern, J. C. Eichstaedt, H. A. Schwartz,
G. Park, L. H. Ungar, D. J. Stillwell, M. Kosinski,
L. Dziurzynski, and M. Seligman. 2014b. From
“sooo excited!!!” to “so proud”: Using language
to study development. Developmental Psychology,
50:178–188.
Edward Loper and Steven Bird. 2002. NLTK: The
natural language toolkit. In Proc. of the ACL-02
Workshop on Effective Tools and Methodologies for
Teaching Natural Language Processing and Com-
putational Linguistics - Volume 1, ETMTNLP ’02,
pages 63–70, Stroudsburg, PA, USA. Association
for Computational Linguistics.
Andrew Kachites McCallum. 2002. Mallet:
A machine learning for language toolkit.
Http://mallet.cs.umass.edu.
Wes McKinney. 2011. pandas: a foundational python
library for data analysis and statistics.
Scott Menard. 2002. Applied logistic regression analy-
sis. 106. Sage.
Burt L Monroe, Michael P Colaresi, and Kevin M
Quinn. 2008. Fightin’ words: Lexical feature se-
lection and evaluation for identifying the content of
political conflict. Polit. Anal., 16(4):372–403.
Greg Park, H. A. Schwartz, J. C. Eichstaedt, M. L.
Kern, D. J. Stillwell, M. Kosinski, L. H. Ungar, and
M. Seligman. 2015. Automatic personality assess-
ment through social media language. Journal of Per-
sonality and Social Psychology, 108:934–952.
Gregory Park, D. B. Yaden, H. A. Schwartz, M. L.
Kern, J. C. Eichstaedt, M. Kosinski, D. Stillwell,
L. H. Ungar, and M. Seligman. 2016. Women are
warmer but no less assertive than men: Gender and
language on facebook. PloS one, 11(5):e0155885.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
sos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. 2011. Scikit-learn: Machine learn-
ing in Python. JMLR, 12:2825–2830.
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and
Kate Blackburn. 2015. The development and psy-
chometric properties of LIWC2015. Technical re-
port.
D. Preot¸iuc-Pietro, H. A. Schwartz, G. Park, J. Eich-
staedt, M. Kern, L. Ungar, and E. P. Shulman. 2016.
Modelling valence and arousal in facebook posts.
In Proc. of the Workshop on Computational Ap-
proaches to Subjectivity, Sentiment and Social Me-
dia Analysis (WASSA), NAACL.
Daniel Preot¸iuc-Pietro, M. Sap, H. A. Schwartz, and
L. H. Ungar. 2015. Mental illness detection at the
World Well-Being Project for the CLPsych 2015
Shared Task. In Proc. of the Workshop on Compu-
tational Linguistics and Clinical Psychology: From
Linguistic Signal to Clinical Reality, NAACL.
C Radhakrishna Rao. 2009. Linear statistical infer-
ence and its applications, volume 22. John Wiley
& Sons.
Maarten Sap, G. Park, J. C. Eichstaedt, M. L. Kern,
D. J. Stillwell, M. Kosinski, L. H. Ungar, and H. A.
Schwartz. 2014. Developing age and gender predic-
tive lexica over social media. In EMNLP.
Jonathan Schler, Moshe Koppel, Shlomo Argamon,
and James W Pennebaker. 2006. Effects of age and
gender on blogging. In AAAI spring symposium.
H Andrew Schwartz, J. Eichstaedt, M. L. Kern,
G. Park, M. Sap, D. Stillwell, M. Kosinski, and
L. Ungar. 2014. Towards assessing changes in de-
gree of depression through Facebook. In Proc. of the
Workshop on Computational Linguistics and Clini-
cal Psychology: From Linguistic Signal to Clinical
Reality, ACL, pages 118–125.
H Andrew Schwartz, J. C. Eichstaedt, L. Dziurzynski,
M. L. Kern, E. Blanco, S. Ramones, M. Seligman,
and L. H. Ungar. 2013a. Choosing the right words:
Characterizing and reducing error of the word count
approach. In *SEM: Conf on Lex and Comp Seman-
tics, pages 296–305.
H Andrew Schwartz, J. C. Eichstaedt, M. L. Kern,
L. Dziurzynski, S. M. Ramones, M. Agrawal,
A. Shah, M. Kosinski, D. Stillwell, M. Seligman,
and L. H. Ungar. 2013b. Personality, gender, and
age in the language of social media: The Open-
Vocabulary approach. PLoS ONE.
H Andrew Schwartz, G. Park, M. Sap, E. Weingarten,
J. Eichstaedt, M. Kern, D. Stillwell, M. Kosinski,
J. Berger, M. Seligman, and L. Ungar. 2015. Ex-
tracting human temporal orientation from Facebook
language. In NAACL.
Skipper Seabold and Josef Perktold. 2010. Statsmod-
els: Econometric and statistical modeling with
python. In 9th Python in Science Conference.
Laura Smith, S. Giorgi, R. Solanki, J. C. Eichstaedt,
H. A. Schwartz, M. Abdul-Mageed, A. Buffone, and
Lyle H. Ungar. 2016. Does ’well-being’ translate on
twitter? In EMNLP.
Philip J Stone, Dexter C Dunphy, and Marshall S
Smith. 1966. The general inquirer: A computer ap-
proach to content analysis.
60
... Then we have the target bot, being evaluated, talk to each of the diverse chatbots. Finally, we evaluate the resulting conversations with GPT-4o mini, psychologically driven lexica, such as Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al. 2015), and NLP evaluation packages such as the Differential Language Analysis Toolkit (DLATK) (Schwartz et al. 2017). ...
... Prompt Evaluation We evaluate our prompts using a variety of tools, shown in Table 1. DLATK is an open-source Python package and commandline tool designed for conducting social-scientific language analyses (Schwartz et al. 2017). In our study, we utilize DLATK to evaluate age, gender, and topic features within dialogues. ...
Preprint
Large Language Models (LLMs), which simulate human users, are frequently employed to evaluate chatbots in applications such as tutoring and customer service. Effective evaluation necessitates a high degree of human-like diversity within these simulations. In this paper, we demonstrate that conversations generated by GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. These features include topic variation, lexical attributes, and both the average behavior and diversity (variance) of the language used. To address these discrepancies, we propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions, such as age, gender, emotional tone, and the topics discussed. We assess our approach using differential language analysis combined with deep linguistic inquiry. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements. Specifically, it enhances the human-likeness of LLM chatbot conversations, increasing their linguistic diversity. On average, we observe a 54 percent reduction in the error of average features between human and LLM-generated conversations. This method of constructing chatbot sets with human-like diversity holds great potential for enhancing the evaluation process of user-facing bots.
... By utilizing a comprehensive dictionary, it allows researchers to evaluate written texts in several categories based on word counts. A more detailed description of LIWC and its features is given in Section 4. With a similar purpose, Differential Language Analysis Toolkit (DLATK) [22], an open-source Python library, was created for text analysis with a psychological, health, or social focus. ...
Preprint
Depression is ranked as the largest contributor to global disability and is also a major reason for suicide. Still, many individuals suffering from forms of depression are not treated for various reasons. Previous studies have shown that depression also has an effect on language usage and that many depressed individuals use social media platforms or the internet in general to get information or discuss their problems. This paper addresses the early detection of depression using machine learning models based on messages on a social platform. In particular, a convolutional neural network based on different word embeddings is evaluated and compared to a classification based on user-level linguistic metadata. An ensemble of both approaches is shown to achieve state-of-the-art results in a current early detection task. Furthermore, the currently popular ERDE score as metric for early detection systems is examined in detail and its drawbacks in the context of shared tasks are illustrated. A slightly modified metric is proposed and compared to the original score. Finally, a new word embedding was trained on a large corpus of the same domain as the described task and is evaluated as well.
... This is done across each domain. Feature extraction, correlation analysis, and word cloud visualization are performed using the DLATK python package (Schwartz et al., 2017). ...
... We preserve emoticons and punctuation and replace tokens that appear in less than five tweets with an 'unknown' token. We tokenize text using a Twitter-aware tokenizer (Schwartz et al., 2017). ...
... This model was then applied to conversations in the WASSA 2023 and 2024, producing estimates of perceived empathy, which are then compared to the human and chabot general empathy ratings. This entire process was done using the DLATK Python package (Schwartz et al. 2017). ...
Preprint
Full-text available
As AI chatbots become more human-like by incorporating empathy, understanding user-centered perceptions of chatbot empathy and its impact on conversation quality remains essential yet under-explored. This study examines how chatbot identity and perceived empathy influence users' overall conversation experience. Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. Empathy ratings from GPT-4o annotations aligned with users' ratings, reinforcing the perception of lower empathy in chatbots. In contrast, 3 out of 5 empathy models trained on human-human conversations detected no significant differences in empathy language between chatbots and humans. Our findings underscore the critical role of perceived empathy in shaping conversation quality, revealing that achieving high-quality human-AI interactions requires more than simply embedding empathetic language; it necessitates addressing the nuanced ways users interpret and experience empathy in conversations with chatbots.
... We then collected the full timelines (i.e., the entire collection of tweets posted by the user from 2009 to 2017) for the 6,439 sampled users, and again applied our keyword list as a filter to obtain only HIV-relevant tweets from them. Non-English tweets were identified using existing open-source algorithms [34,35] and removed. Tweets with links to commercial pornography websites were removed using a classifier designed by the research team (75.7% out-of-sample prediction accuracy). ...
Article
Full-text available
Background Adolescents and young adults account for over 21% of new HIV infections in the U.S. with most new cases among young men. As an important information source for this group, social media can uniquely reveal the perspectives and communicative patterns of this key population. We identified 6,439 young male Twitter users (ages 13–24) in the U.S. using an NLP pipeline with geolocations. From their Twitter timelines, we collected 24,600 HIV-related tweets, among which the most retweeted and favorited tweets ( n = 472) were analyzed through a content analysis. Results Three themes arose in this online viral discourse around HIV among young men: (i) othering , (ii) politics and activism , (iii) risk and wellness . Othering tweets contained stigmatizing jokes and insults alienating individuals who identify as lesbian, gay, bisexual, transgender, queer or questioning, intersex, asexual, or being elsewhere on the gender and sexuality spectrum (LGBTQIA +), and people with HIV. Politics and activism tweets discussed awareness, stigma, HIV criminalization, violence, LGBTQIA + , and women’s rights. Risk and wellness tweets discussed risk behaviors for sexually transmitted infections (STIs) (e.g., condomless sex, transactional sex, multiple sexual partners), or safer sex and preventive practices (e.g., pre-exposure prophylaxis [PrEP], condom use, achieving undetectable viral load, medication adherence, and STI testing). Conclusion The social acceptability of high-risk sex behaviors is high among young male Twitter users. Given the double-edged nature of social media—health-promoting (e.g., awareness, health activism) as well as risk-promoting (e.g., risky behavior endorsement, identity attacks)— this population may benefit from targeted health communication intervention. Future HIV prevention efforts should counter the stigma, misinformation, and risk-promoting viral messages prevalent on social media.
Article
For more than a century, self-report inventories have been the traditional method for assessing vocational interests. Little research has examined the use of machine learning techniques, such as natural language processing (NLP), in interest assessment. This paper explores the extent to which natural language on social media can be used to predict individuals’ self-ratings on eight basic interests representing the SETPOINT model: Agriculture, Engineering, Human Resources, Life Science, Management/Administration, Mechanics/Electronics, Media, and Social Science. Leveraging closed- (Linguistic Inquiry and Word Count; LIWC) and open-vocabulary NLP approaches (Latent Dirichlet allocation (LDA) topic modeling), we analyzed 3.2 million Facebook posts from 2,834 participants who completed a 32-item basic interest measure. We found that the convergent validities of these NLP approaches for predicting vocational interest scores (LIWC: [Formula: see text] = 0.19; LDA topic modeling: [Formula: see text] = 0.24) are comparable to prior research on language-based personality assessments. Our study also revealed largely face-valid language markers that characterize different basic interests. Implications for developing language-based interest assessments for applied settings (e.g., career guidance and employee selection) and future research directions are discussed.
Preprint
Full-text available
BACKGROUND: Modern Artificial Intelligence (AI) has shown promise in identifyingpsychopathology based on the language used by patients, providing a scalable method forobtaining relevant behavioral markers. However, no existing models for assessing posttraumaticstress disorder (PTSD) have successfully demonstrated out-of-sample replicability. We developa language-based AI model for PTSD and rigorously evaluate replicability in a prospectivesample.METHODS: Participants from the Stony Brook World Trade Center (WTC) Health andWellness Program described their lives in an automated interview during a clinical monitoringvisit. The language was analysed using AI to assess PTSD CheckList (PCL) for total symptomseverity score and four symptom subscales and validated against medical record PTSDdiagnosis.To yield realistic accuracy estimates in this cross-sectional study, we propose the SequentialEvaluation with Model Pre-registration design, consisting of an iterative, two-phase pre-registration paradigm. The first pre-registration specifies the data split, the model development,and the initial hypotheses. The second pre-registration specifies the exact pre-trained models,data cleaning procedures, and the refined hypotheses.RESULTS: The data split included a development (N=1437) and a prospective (N=346) dataset.Within the prospective sample, the pre-registered models produced scores that significantlycorrelated with their targets: PCL total (r=.38, p-value<.001) and the four subscales (r=.28–.37,p-value<.001). The pre-registered model for PCL total showed a robust association with PTSDdiagnosis (AUC=.76), significantly outperforming demographics (AUC=.61, p-value=.006),WTC attack exposures (AUC=.61, p-value=.007) and a validated depression language model(AUC=.60, p-value<.001).CONSLUSIONS: We developed new AI-language assessments of PTSD symptom severity.Within a clinical setting and over prospectively collected participant data, the assessmentsreplicated with high convergent validity with self-report and high external validity againstdiagnosis in medical records. Analyses of observable behavioral markers in automated clinicalinterview language can produce robust psychiatric assessments, overcoming limitations foundin traditional assessments.
Article
One of the main barriers faced by proposed affordable housing developments is local public opposition. Consequently, this exploratory study seeks: 1) to determine the U.S. public's current top-of-mind perceptions of affordable housing, and 2) to understand which of these perceptions are related to self-reported support for affordable housing in one's own neighborhood. We employ a mixed-methods approach, administering a nation-wide online survey (N = 534) of close- and open-ended questions. While no majority understood definition emerged, through our discursive analysis and natural language process (NLP) topic modeling we uncover common and persisting perceptions of current affordable housing buildings as federally supported apartments that are run-down and in undesirable neighborhoods. Narratives about residents are limited, though when prompted participants have a clear perceived racial profile of residents. Using a conditional inference regression tree (CI-Tree) and NLP language feature associations, we also find that perceptions focused on government involvement, subsidies, and unsafe neighborhoods are significant predictors of lower support for proposed developments. Alternatively, mentioning the financial aspects of affordable housing is significantly associated with higher support. These findings demonstrate the potential of machine learning methods in uncovering pathways to support affordable housing and can be leveraged for effective framing of proposed developments.
Conference Paper
Full-text available
We investigate whether psychological well-being translates across English and Span-ish Twitter, by building and comparing source language and automatically translated weighted lexica in English and Spanish. We find that the source language models perform substantially better than the machine translated versions. Moreover, manually correcting translation errors does not improve model performance, suggesting that meaningful cultural information is being lost in translation. Further work is needed to clarify when automatic translation of well-being lexica is effective and how it can be improved for cross-cultural analysis.
Article
Full-text available
Background Assessing the efficacy of Internet interventions that are already in the market introduces both challenges and opportunities. While vast, often unprecedented amounts of data may be available (hundreds of thousands, and sometimes millions of participants with high dimensions of assessed variables), the data are observational in nature, are partly unstructured (eg, free text, images, sensor data), do not include a natural control group to be used for comparison, and typically exhibit high attrition rates. New approaches are therefore needed to use these existing data and derive new insights that can augment traditional smaller-group randomized controlled trials. Objective Our objective was to demonstrate how emerging big data approaches can help explore questions about the effectiveness and process of an Internet well-being intervention. Methods We drew data from the user base of a well-being website and app called Happify. To explore effectiveness, multilevel models focusing on within-person variation explored whether greater usage predicted higher well-being in a sample of 152,747 users. In addition, to explore the underlying processes that accompany improvement, we analyzed language for 10,818 users who had a sufficient volume of free-text response and timespan of platform usage. A topic model constructed from this free text provided language-based correlates of individual user improvement in outcome measures, providing insights into the beneficial underlying processes experienced by users. Results On a measure of positive emotion, the average user improved 1.38 points per week (SE 0.01, t122,455=113.60, P<.001, 95% CI 1.36–1.41), about an 11% increase over 8 weeks. Within a given individual user, more usage predicted more positive emotion and less usage predicted less positive emotion (estimate 0.09, SE 0.01, t6047=9.15, P=.001, 95% CI .07–.12). This estimate predicted that a given user would report positive emotion 1.26 points (or 1.26%) higher after a 2-week period when they used Happify daily than during a week when they didn’t use it at all. Among highly engaged users, 200 automatically clustered topics showed a significant (corrected P<.001) effect on change in well-being over time, illustrating which topics may be more beneficial than others when engaging with the interventions. In particular, topics that are related to addressing negative thoughts and feelings were correlated with improvement over time. Conclusions Using observational analyses on naturalistic big data, we can explore the relationship between usage and well-being among people using an Internet well-being intervention and provide new insights into the underlying mechanisms that accompany it. By leveraging big data to power these new types of analyses, we can explore the workings of an intervention from new angles, and harness the insights that surface to feed back into the intervention and improve it further in the future.
Article
Full-text available
Using a large social media dataset and open-vocabulary methods from computational linguistics, we explored differences in language use across gender, affiliation, and assertiveness. In Study 1, we analyzed topics (groups of semantically similar words) across 10 million messages from over 52,000 Facebook users. Most language differed little across gender. However, topics most associated with self-identified female participants included friends, family, and social life, whereas topics most associated with self-identified male participants included swearing, anger, discussion of objects instead of people, and the use of argumentative language. In Study 2, we plotted male- and female-linked language topics along two interpersonal dimensions prevalent in gender research: affiliation and assertiveness. In a sample of over 15,000 Facebook users, we found substantial gender differences in the use of affiliative language and slight differences in assertive language. Language used more by self-identified females was interpersonally warmer, more compassionate, polite, and—contrary to previous findings—slightly more assertive in their language use, whereas language used more by self-identified males was colder, more hostile, and impersonal. Computational linguistic analysis combined with methods to automatically label topics offer means for testing psychological theories unobtrusively at large scale.
Article
People associate certain behaviors with certain social groups. These stereotypical beliefs consist of both accurate and inaccurate associations. Using large-scale, data-driven methods with social media as a context, we isolate stereotypes by using verbal expression. Across four social categories—gender, age, education level, and political orientation—we identify words and phrases that lead people to incorrectly guess the social category of the writer. Although raters often correctly categorize authors, they overestimate the importance of some stereotype-congruent signal. Findings suggest that data-driven approaches might be a valuable and ecologically valid tool for identifying even subtle aspects of stereotypes and highlighting the facets that are exaggerated or misapplied.