
Bennett KleinbergTilburg University | UVT · Department of Methodology and Statistics
Bennett Kleinberg
PhD
About
86
Publications
129,104
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,408
Citations
Introduction
Additional affiliations
September 2018 - present
November 2015 - August 2018
Publications
Publications (86)
Counterfeits harm consumers, governments, and intellectual property holders. They accounted for 3.3% of worldwide trades in 2016, having an estimated value of $509 billion in the same year. Estimations in the literature are mostly based on border seizures, but in this paper, we examined openly labeled counterfeits on darknet markets, which allowed...
Spurred by the recent rapid increase in the development and distribution of large language models (LLMs) across industry and academia, much recent work has drawn attention to safety- and security-related threats and vulnerabilities of LLMs, including in the context of potentially criminal activities. Specifically, it has been shown that LLMs can be...
Besides far-reaching public health consequences, the COVID-19 pandemic had a significant psychological impact on people around the world. To gain further insight into this matter, we introduce the Real World Worry Waves Dataset (RW3D). The dataset combines rich open-ended free-text responses with survey data on emotions, significant life events, an...
Large Language Models (LLMs) could enhance access to the legal system. However, empirical research on their effectiveness in conducting legal tasks is scant. We study securities cases involving cryptocurrencies as one of numerous contexts where AI could support the legal process, studying LLMs' legal reasoning and drafting capabilities. We examine...
Two studies tested the hypothesis that a Large Language Model (LLM) can be used to model psychological change following exposure to influential input. The first study tested a generic mode of influence - the Illusory Truth Effect (ITE) - where earlier exposure to a statement (through, for example, rating its interest) boosts a later truthfulness te...
Fraud across the decentralized finance (DeFi) ecosystem is growing, with victims losing billions to DeFi scams every year. However, there is a disconnect between the reported value of these scams and associated legal prosecutions. We use open-source investigative tools to (1) triage Ethereum tokens extracted from the Ethereum blockchain for further...
Besides far-reaching public health consequences, the COVID-19 pandemic had a significant psychological impact on people around the world. To gain further insight into this matter, we introduce the Real World Worry Waves Dataset (RW3D). The dataset combines rich open-ended free-text responses with survey data on emotions, significant life events, an...
Large-scale linguistic analyses are increasingly applied to the study of extremism, terrorism, and other threats of violence. At the same time, practitioners working in the field of counterterrorism and security are confronted with large-scale linguistic data, and may benefit from computational methods. This article highlights the challenges and op...
The popularity of online shopping is steadily increasing. At the same time, fake product reviews are published widely and have the potential to affect consumer purchasing behavior. In response, previous work has developed automated methods utilizing natural language processing approaches to detect fake product reviews. However, studies vary conside...
Counterfeits harm consumers, governments, and intellectual property holders. They accounted for 3.3% of worldwide trades in 2016, having an estimated value of $509 billion in the same year. While estimations are mostly based on border seizures, we examined openly labeled counterfeits on darknet markets, which allowed us to gather and analyze inform...
Large-scale linguistic analyses are increasingly applied to the study of extremism, terrorism, and other threats of violence. At the same time, practitioners working in the field of counterterrorism and security are confronted with large-scale linguistic data, and may benefit from computational methods. This article highlights the challenges and op...
Adversarial examples in NLP are receiving increasing research attention. One line of investigation is the generation of word-level adversarial examples against fine-tuned Transformer models that preserve naturalness and grammaticality. Previous work found that human- and machine-generated adversarial examples are comparable in their naturalness and...
People are regularly confronted with potentially deceptive statements (e.g., fake news, misleading product reviews, or lies about activities). Only few works on automated text-based deception detection have exploited the potential of deep learning approaches. A critique of deep-learning methods is their lack of interpretability, preventing us from...
The increased use of text data in social science research has benefited from easy-to-access data (e.g., Twitter). That trend comes at the cost of research requiring sensitive but hard-to-share data (e.g., interview data, police reports, electronic health records). We introduce a solution to that stalemate with the open-source text anonymisation sof...
The relation between religiosity and well-being is one of the most researched topics in the psychology of religion, yet the directionality and robustness of the effect remains debated. Here, we adopted a many-analysts approach to assess the robustness of this relation based on a new cross-cultural dataset (N = 10, 535 participants from 24 countries...
This chapter seeks to understand lone-actor terrorists through their use of language. Studies examining terrorist and extremist online postings, pre-attack threats, and manifestos are described. The authors specifically focus on efforts in which linguistic analysis is performed automatically, for example, to measure potential warning signs of viole...
Anonymously written threats constitute a special form of worrying behavior, in which the author of a threat decides to hide their identity. Importantly, anonymous threats are an increasingly common issue exacerbated by online communication. Anonymity raises additional challenges for threat assessors, but little is known about how practitioners appr...
The problem of online threats and abuse directed at public figures could potentially be mitigated with a computational approach, where sources of abusive language are better understood or identified through author profiling. However, abusive language constitutes a specific domain of language that is untested on whether differences emerge based on p...
Background
Cryptocurrency fraud has become a growing global concern, with various governments reporting an increase in the frequency of and losses from cryptocurrency scams. Despite increasing fraudulent activity involving cryptocurrencies, research on the potential of cryptocurrencies for fraud has not been examined in a systematic study. This rev...
The introduction of COVID-19 lockdown measures and an outlook on return to normality are demanding societal changes. Among the most pressing questions is how individuals adjust to the pandemic. This paper examines the emotional responses to the pandemic in a repeated-measures design. Data ( n = 1698) were collected in April 2020 (during strict lock...
In the Original publication of the article, two errors were found.
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render a...
The introduction of COVID-19 lockdown measures and an outlook on return to normality are demanding societal changes. Among the most pressing questions is how individuals adjust to the pandemic. This paper examines the emotional responses to the pandemic in a repeated-measures design. Data (n=1698) were collected in April 2020 (during strict lockdow...
In this crowdsourced initiative, independent analysts used the same dataset to test two hypotheses regarding the effects of scientists' gender and professional status on verbosity during group meetings. Not only the analytic approach but also the operationalizations of key variables were left unconstrained and up to individual analysts. For instanc...
Purpose: Truthful statements are theorized to be richer in perceptual and contextual detail than deceptive statements. The level of detail can be coded by humans or computers, with human coding argued to be superior. Direct comparisons of human and automated coding, however, are rare.Methods: We applied automatic identification of details with the...
The increased threat of right-wing extremist violence necessitates a better understanding of online extremism. Radical message boards, small-scale social media platforms, and other internet fringes have been reported to fuel hatred. The current paper examines data from the right-wing forum Stormfront between 2001 and 2015. We specifically aim to un...
The media frequently describes the 2017 Charlottesville ‘Unite the Right’ rally as a turning point for the alt-right and white supremacist movements. Social movement theory suggests that the media attention and public discourse concerning the rally may have engendered changes in social identity performance and visibility of the alt-right, but this...
This paper introduces the Grievance Dictionary, a psycholinguistic dictionary that can be used to automatically understand language use in the context of grievance-fueled violence threat assessment. We describe the development of the dictionary, which was informed by suggestions from experienced threat assessment practitioners. These suggestions an...
For sensitive text data to be shared among NLP researchers and practitioners, shared documents need to comply with data protection and privacy laws. There is hence a growing interest in automated approaches for text anonymization. However, measuring such methods' performance is challenging: missing a single identifying attribute can reveal an indiv...
Background
Deception detection is a prevalent problem for security practitioners. With a need for more large-scale approaches, automated methods using machine learning have gained traction. However, detection performance still implies considerable error rates. Findings from different domains suggest that hybrid human-machine integrations could offe...
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render a...
Among the critical challenges around the COVID-19 pandemic is dealing with the potentially detrimental effects on people’s mental health. Designing appropriate interventions and identifying the concerns of those most at risk requires methods that can extract worries, concerns and emotional responses from text data. We examine gender differences and...
Among the critical challenges around the COVID-19 pandemic is dealing with the potentially detrimental effects on people’s mental health. Designing appropriate interventions and identifying the concerns of those most at risk requires methods that can extract worries, concerns and emotional responses from text data. We examine gender differences and...
This paper introduces the Grievance Dictionary, a psycholinguistic dictionary which can be used to automatically understand language use in the context of grievance-fuelled violence threat assessment. We describe the development the dictionary, which was informed by suggestions from experienced threat assessment practitioners. These suggestions and...
The problem of online threats and abuse could potentially be mitigated with a computational approach, where sources of abuse are better understood or identified through author profiling. However, abusive language constitutes a specific domain of language for which it has not yet been tested whether differences emerge based on a text author's person...
Text data are being used as a lens through which human cognition can be studied at a large scale. Methods like emotion analysis are now in the standard toolkit of computational social scientists but typically rely on third-person annotation with unknown validity. As an alternative, this paper introduces online emotion induction techniques from expe...
Among the critical challenges around the COVID-19 pandemic is dealing with potentially detrimental effects on people's mental health. Designing appropriate interventions and identifying the concerns of those most at risk requires methods that can extract worries, concerns and emotional responses from text data. We examine gender differences and the...
While recent efforts have shown that neural text processing models are vulnerable to adversarial examples, comparatively little attention has been paid to explicitly characterize their effectiveness. To overcome this, we present analytical insights into the word frequency characteristics of word-level adversarial examples for neural text classifica...
The current policy of removing drill music videos from social media platforms such as YouTube remains controversial because it risks conflating the co-occurrence of drill rap and violence with a causal chain of the two. Empirically, we revisit the question of whether there is evidence to support the conjecture that drill music and gang violence are...
The COVID-19 pandemic is having a dramatic impact on societies and economies around the world. With various measures of lockdowns and social distancing in place, it becomes important to understand emotional responses on a large scale. In this paper, we present the first ground truth dataset of emotional responses to COVID-19. We asked participants...
Background: Deception detection is a prevalent problem for security practitioners. With a need for more large-scale approaches, automated methods using machine learning have gained traction. However, detection performance still implies considerable error rates. Findings from other domains suggest that hybrid human-machine integrations could offer a...
Despite considerable concern about how human trafficking offenders may use the Internet to recruit their victims, arrange logistics or advertise services, the Internet-trafficking nexus remains unclear. This study explored the prevalence and correlates of a set of commonly-used indicators of labour trafficking in online job advertisements. Taking a...
The Response Time-Based Concealed Information Test (RT-CIT) can reveal when a person recognizes a relevant (probe) item among other, irrelevant items, based on comparatively slower responding to the probe item. Thereby, if a person is concealing the knowledge about the relevance of this item (e.g., recognizing it as a murder weapon), this deception...
The Response Time-Based Concealed Information Test (RT-CIT) can reveal when a person recognizes a relevant ('probe') item among other, irrelevant items, based on comparatively slower responding to the probe item. Thereby, if a person is concealing the knowledge about the relevance of this item (e.g., recognizing it as a murder weapon), this decepti...
This paper presents how techniques from natural language processing can be used to examine the sentiment trajectories of gang-related drill music in the United Kingdom (UK). This work is important because key public figures are loosely making controversial linkages between drill music and recent escalations in youth violence in London. Thus, this p...
The media frequently describes the 2017 Charlottesville 'Unite the Right' rally as a turning point for the alt-right and white supremacist movements. Related research into social movements also suggests that the media attention and public discourse concerning the rally may have influenced the alt-right. Empirical evidence for these claims is largel...
Purpose:
Verbal credibility assessments examine language differences to tell truthful from deceptive statements (e.g., of allegations of child sexual abuse). The dominant approach in psycholegal deception research to date (used in 81% of recent studies that report on accuracy) to estimate the accuracy of a method is to find the optimal statistical...
Social media and tech companies face the challenge of identifying and removing terrorist and extremist content from their platforms. This paper presents the findings of a series of interviews with Global Internet Forum to Counter Terrorism (GIFCT) partner companies and law enforcement Internet Referral Units (IRUs). It offers a unique view on curre...
News consumption exhibits an increasing shift towards online sources, which bring platforms such as YouTube more into focus. Thus, the distribution of politically loaded news is easier , receives more attention, but also raises the concern of forming isolated ideological communities. Understanding how such news is communicated and received is becom...
Purpose: Verbal credibility assessments examine language to discern lie from the truth. These tests are used for the scientific study of the language of lies in US Presidential candidates and fraudulent scientists, but also in criminal proceedings for evaluating allegations of child sexual abuse. The dominant approach in psycholegal deception resea...
Several research lines attempted to tell truthful from deceptive texts by looking at the concreteness in language as an indicator of truthfulness. We identified eight different operationalizations of concreteness for computer-automated analysis and validated these operationalizations on six diverse datasets containing truthful and deceptive texts (...
The proliferation of misleading information in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it challenging to identify trustworthy news sources, thus increasing the need for computational tools able to provide insights into the reliability of online content. In this paper, we focus on the aut...
Abstract Pump-and-dump schemes are fraudulent price manipulations through the spread of misinformation and have been around in economic settings since at least the 1700s. With new technologies around cryptocurrency trading, the problem has intensified to a shorter time scale and broader scope. The scientific literature on cryptocurrency pump-and-du...
Vlogs provide a rich public source of data in a novel setting. This paper examined the continuous sentiment styles employed in 27,333 vlogs using a dynamic intra-textual approach to sentiment analysis. Using unsupervised clustering, we identified seven distinct continuous sentiment trajectories characterized by fluctuations of sentiment throughout...
Verbal deception detection has gained momentum as a technique to tell truth‐tellers from liars. At the same time, researchers' degrees of freedom make it hard to assess the robustness of effects. Replication research can help evaluate how reproducible an effect is. We present the first replication in verbal deception research whereby ferry passenge...
Recently, verbal credibility assessment has been extended to the detection of deceptive intentions , the use of a model statement, and predictive modeling. The current investigation combines these 3 elements to detect deceptive intentions on a large scale. Participants read a model statement and wrote a truthful or deceptive statement about their p...
There is an increasing demand for deception detection at scale. In situations in which larger numbers of people need to be tested, traditional deception-detection methods are limited because they often require extensive testing sessions or are limited in their flexibility to novel contexts. The aim of this chapter is to discuss the potential for la...
When embedded among a number of plausible irrelevant options, the presentation of critical (e.g., crime-related or autobiographical) information is associated with a marked increase in response time (RT). This RT effect crucially depends on the inclusion of a target/non-target discrimination task with targets being a dedicated set of items that req...
Background
Academic research on deception detection has largely focused on the detection of past events. For many applied purposes, however, the detection of false reports about someone’s intention merits attention. Based on the verbal deception detection paradigm, we explored whether true statements on intentions were more detailed and more specif...
There is an increasing demand for automated verbal deception detection systems. We propose named entity recognition (NER; i.e., the automatic identification and extraction of information from text) to model three established theoretical principles: (i) truth tellers provide accounts that are richer in detail, (ii) contain more contextual references...
The proliferation of misleading information in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it challenging to identify trustworthy news sources, thus increasing the need for computational tools able to provide insights into the reliability of online content. In this paper, we focus on the aut...
Background: The shift towards open science, implies that researchers should share their data. Often there is a dilemma between publicly sharing data and protecting their subjects' confidentiality. Moreover, the case of unstructured text data (e.g. stories) poses an additional dilemma: anonymizing texts without deteriorating their content for second...
General Audience Summary
The Concealed Information Test (CIT) assesses recognition of concealed information, for instance about a crime or about one’s true identity. For this purpose, the CIT initially relied upon physiological responses recorded with a polygraph. Nowadays, administration can be done more easily, through the recording of reaction t...
There is an increasing demand for automated verbal deception detection systems. We propose named entity recognition (NER; i.e., the automatic identification and extraction of information from text) based on three established theoretical principles: (i) truth-tellers provide accounts that are richer in detail, (ii) contain more contextual references...
Background: Academic research on deception detection has largely focused on the detection of past events. For many applied purposes, however, the detection of false reports about someone's intention merits attention. Based on the verbal deception detection paradigm, we explored whether true statements on intentions were more detailed and more speci...