ArticlePDF Available

GPT-4V(ision) as A Social Media Analysis Engine

Authors:

Abstract

Recent research has shed light on the capabilities of Large Multimodal Models (LMMs) across various general vision and language tasks. The performance of LMMs in specialized domains, such as social media, which integrates text, images, videos, and sometimes audio, remains an area of active interest. Effective analysis of such content requires models to interpret the complex interactions between different communication modalities and their influence on the conveyed message. This paper explores GPT-4V(ision)’s performance in social multimedia analysis. We evaluate GPT-4V across five representative tasks: sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection. Our approach includes a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a review of the results and a selection of qualitative samples to demonstrate GPT-4V’s performance in multimodal social media content analysis. GPT-4V shows effectiveness in these tasks, exhibiting capabilities like joint image-text understanding, contextual and cultural awareness, and commonsense knowledge application. However, challenges persist, including struggles with multilingual social multimedia comprehension and difficulty in adapting to the latest social media trends. It also sometimes generates incorrect information about evolving knowledge of celebrities and politicians. This preliminary study aims to inform further research across disciplines, particularly in computational social science and social media studies. The findings highlight the potential of LMMs to enhance our understanding of social media content and its users through multimodal analysis. All images and prompts used in this study will be available at https://github.com/VIStA-H/GPT-4V_Social_Media . Disclaimer: This paper contains some examples of offensive social media content. Reader discretion is advised.
GPT-4V(ision) as A Social Media Analysis Engine
Hanjia Lyu1,Jinfa Huang1,Daoan Zhang1,Yongsheng Yu1,Xinyi Mou2,
Jinsheng Pan1Zhengyuan Yang1Zhongyu Wei2Jiebo Luo1
1University of Rochester 2Fudan University
Core Contributor
Abstract
Recent research has offered insights into the extraordinary capabilities of Large
Multimodal Models (LMMs) in various general vision and language tasks. There
is growing interest in how LMMs perform in more specialized domains. Social
media content, inherently multimodal, blends text, images, videos, and sometimes
audio. To effectively understand such content, models need to interpret the intricate
interactions between these diverse communication modalities and their impact
on the conveyed message. Understanding social multimedia content remains a
challenging problem for contemporary machine learning frameworks. In this paper,
we explore GPT-4V(ision)’s capabilities for social multimedia analysis. We select
five representative tasks, including sentiment analysis, hate speech detection, fake
news identification, demographic inference, and political ideology detection, to
evaluate GPT-4V. Our investigation begins with a preliminary quantitative analysis
for each task using existing benchmark datasets, followed by a careful review of
the results and a selection of qualitative samples that illustrate GPT-4V’s potential
in understanding multimodal social media content. GPT-4V demonstrates remark-
able efficacy in these tasks, showcasing strengths such as joint understanding of
image-text pairs, contextual and cultural awareness, and extensive commonsense
knowledge. Despite the overall impressive capacity of GPT-4V in the social media
domain, there remain notable challenges. GPT-4V struggles with tasks involving
multilingual social multimedia comprehension and has difficulties in generaliz-
ing to the latest trends in social media. Additionally, it exhibits a tendency to
generate erroneous information in the context of evolving celebrity and politician
knowledge, reflecting the known hallucination problem. Our hope is that this
preliminary study will provide insights into further research across disciplines,
particularly in computational social science and social media-related studies. The
insights gleaned from our findings underscore a promising future for LMMs in
enhancing our comprehension of social media content and its users through the
analysis of multimodal information. All images and prompts used in this report
will be available at https://github.com/VIStA-H/GPT-4V_Social_Media.1
Disclaimer: This paper contains some examples of offensive social media
content. Reader discretion is advised.
1
Note that we conducted our experiments and analysis before November 6
th
, 2023 via the official ChatGPT
webpage.
arXiv:2311.07547v1 [cs.CV] 13 Nov 2023
Contents
1 Introduction 4
2 Experiments for Selected Social Multimedia Analysis Tasks 6
2.1 SentimentAnalysis .................................. 6
2.1.1 Task Setting and Preliminary Quantitative Results . . . . . . . . . . . . . . 6
2.1.2 Sentiment-Infused Caption Generation and Intepretation . . . . . . . . . . 7
2.1.3 Image-Text Sentiment Correlation . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Contextual Acuity in Sentiment Understanding . . . . . . . . . . . . . . . 7
2.1.5 Fine-grained Sentiment Assessment . . . . . . . . . . . . . . . . . . . . . 7
2.2 HateSpeechDetection ................................ 12
2.2.1 Task Setting and Preliminary Quantitative Results . . . . . . . . . . . . . . 12
2.2.2 Hate Speech Detection with Cultural Insights . . . . . . . . . . . . . . . . 12
2.2.3 Hatred Decoding in Seemingly Neutral Image-Text Pairs . . . . . . . . . . 13
2.2.4 Nuanced Assessment of Potential Hate Speech . . . . . . . . . . . . . . . 13
2.2.5 Slang and Subtext Recognition . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 FakeNewsIdentication ............................... 21
2.3.1 Task Setting and Preliminary Quantitative Results . . . . . . . . . . . . . . 21
2.3.2 Tone and Language Analysis for Authenticity . . . . . . . . . . . . . . . . 21
2.3.3 Celebrity Knowledge Inquiry . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Source Credibility Profiling . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Cross-Referencing from Broad Information Spectrum . . . . . . . . . . . . 21
2.4 DemographicInference................................ 27
2.4.1 Task Setting and Preliminary Quantitative Results . . . . . . . . . . . . . . 27
2.4.2 Gender Clue Analysis in Textual Narratives . . . . . . . . . . . . . . . . . 27
2.4.3 Ambiguous Signal Interpretation for Gender Prediction . . . . . . . . . . . 27
2.4.4 Acknowledgment for Diversity and Complexity of Gender Identity . . . . . 27
2.5 IdeologyDetection .................................. 32
2.5.1 Task Setting and Preliminary Quantitative Results . . . . . . . . . . . . . . 32
2.5.2 Text-Centric Political Ideology Assessment . . . . . . . . . . . . . . . . . 32
2.5.3 Comprehensive Political Domain Knowledge . . . . . . . . . . . . . . . . 32
2.5.4 Ideological Deductions from Visual Subtleties . . . . . . . . . . . . . . . . 32
3 Challenges and Opportunities 42
3.1 Multilingual Social Multimedia Understanding . . . . . . . . . . . . . . . . . . . 42
3.2 Generalization of GPT-4V for Emerging Trends . . . . . . . . . . . . . . . . . . . 42
3.3 Hallucination from Out-dated Knowledge . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Advocating for Novel Benchmark Datasets for LMMs . . . . . . . . . . . . . . . . 50
4 Conclusions 50
2
List of Figures
1 Section 1: social multimedia analysis task description. . . . . . . . . . . . . . . . 5
2 Section 2: overview of emerging properties of social multimedia analysis tasks. . . 6
3 Section 2.1: sentiment-infused caption generation and interpretation. . . . . . . . . 8
4 Section 2.1: image-text sentiment correlation. . . . . . . . . . . . . . . . . . . . . 9
5 Section 2.1: contextual acuity in sentiment understandng. . . . . . . . . . . . . . . 10
6 Section 2.1: fine-grained sentiment assessment. . . . . . . . . . . . . . . . . . . . 11
7 Section 2.2: prompt example for the quantitative hate speech detection experiment. 12
8 Section 2.2: hate speech detection with cultural insights. . . . . . . . . . . . . . . 14
9 Section 2.2: hatred decoding in seemingly neutral image-text pairs example 1. . . . 15
10 Section 2.2: hatred decoding in seemingly neutral image-text pairs example 2. . . . 16
11 Section 2.2: hatred decoding in seemingly neutral image-text pairs example 3. . . . 17
12 Section 2.2: nuanced assessment of potential hate speech. . . . . . . . . . . . . . . 18
13 Section 2.2: slang and subtext recognition example 1. . . . . . . . . . . . . . . . . 19
14 Section 2.2: slang and subtext recognition example 2. . . . . . . . . . . . . . . . . 20
15 Section 2.3: tone and language analysis. . . . . . . . . . . . . . . . . . . . . . . . 22
16 Section 2.3: celebrity knowledge inquiry. . . . . . . . . . . . . . . . . . . . . . . 23
17 Section 2.3: source credibility profiling. . . . . . . . . . . . . . . . . . . . . . . . 24
18 Section 2.3: cross-referencing from broad information spectrum example 1. . . . . 25
19 Section 2.3: cross-referencing from broad information spectrum example 2. . . . . 26
20 Section 2.4: gender clue analysis in textual narratives example 1. . . . . . . . . . . 28
21 Section 2.4: gender clue analysis in textual narratives example 2. . . . . . . . . . . 29
22 Section 2.4: ambiguous signal interpretation. . . . . . . . . . . . . . . . . . . . . 30
23 Section 2.4: acknowledgment for diversity and complexity. . . . . . . . . . . . . . 31
24 Section 2.5: text-centric assessment. . . . . . . . . . . . . . . . . . . . . . . . . . 34
25 Section 2.5: comprehensive political domain knowledge example 1. . . . . . . . . 35
26 Section 2.5: comprehensive political domain knowledge example 2. . . . . . . . . 36
27 Section 2.5: comprehensive political domain knowledge example 3. . . . . . . . . 37
28 Section 2.5: comprehensive political domain knowledge example 4. . . . . . . . . 38
29 Section 2.5: ideological deductions from vision subtleties example 1. . . . . . . . . 39
30 Section 2.5: ideological deductions from vision subtleties example 2. . . . . . . . . 40
31 Section 2.5: ideological deductions from vision subtleties example 3. . . . . . . . . 41
32 Section 3.1: multilingual social multimedia understanding example 1. . . . . . . . 43
33 Section 3.1: multilingual social multimedia understanding example 2. . . . . . . . 44
34 Section 3.2: generalization for emerging trends example 1. . . . . . . . . . . . . . 45
35 Section 3.2: generalization for emerging trends example 2. . . . . . . . . . . . . . 46
36 Section 3.2: generalization for emerging trends example 3. . . . . . . . . . . . . . 47
37 Section 3.2: generalization for emerging trends example 4. . . . . . . . . . . . . . 48
38 Section 3.3: hallucination assessment. . . . . . . . . . . . . . . . . . . . . . . . . 49
3
1 Introduction
Motivation and Overview. Social multimedia analysis encompasses a diverse range of tasks, each
characterized by its inherent complexity and nuanced characteristics [
4
,
93
,
32
,
74
,
69
,
15
,
38
,
64
,
58
,
88
]. This field is distinguished by its multimodal nature, where text, images, videos, and other
media forms intertwine to convey rich, multifaceted messages. The tasks within this domain, such as
sentiment analysis, hate speech detection, and fake news identification, are not only about parsing
textual content but also about deciphering the subtleties conveyed through visual cues. The interplay
of various forms of data adds layers of context, requiring an advanced level of comprehension that
transcends traditional unimodal analysis. Additionally, the rapid pace of social media evolution
introduces constant variability, with new slang, symbols, and trends emerging continuously, making
the analysis ever more challenging.
In this intricate milieu, large multimodal models like GPT-4V(ision) [
60
,
86
,
99
] represent a leap for-
ward in artificial intelligence. Their ability to process and interpret complex, multimodal information
makes them particularly suited for the demands of social multimedia analysis. Recent research has
undertaken an extensive evaluation of GPT-4V’s capabilities across a spectrum of vision and language
tasks [
86
,
81
,
98
,
99
,
16
,
71
,
39
,
42
,
40
,
87
,
37
,
79
,
7
,
103
,
27
]. These investigations have shed
valuable light on the potential of Large Multimodal Models (LMMs) like GPT-4V in the domains
of both vision and language. However, the broader application of these techniques for societal
benefit remains a question. Social media-related studies have the potential to enhance social good
by addressing harmful content, fostering positive behavior, and bolstering public safety. They can
also enable community building, improved communication, and the empowerment of marginalized
groups, leading to a more inclusive and informed society. Assessing the effectiveness of LMMs like
GPT-4V in comprehending and responding to social media nuances is pivotal for several reasons.
First, it helps assess the model’s capability to handle the diversity and complexity of real-world data,
providing insights into its practical applicability. Second, such evaluation sheds light on the model’s
understanding of cultural and contextual nuances, which are vital for accurate analysis in a globally
connected, digital world. Lastly, it aids in identifying areas where these models excel and where they
need improvement, guiding future advancements in AI and machine learning.
Our study aims to contribute to the understanding of how effectively GPT-4V can navigate the
nuanced domain of social multimedia analysis, highlighting its strengths and areas for development.
Considering the unique characteristics of multimodal social media content, our exploration of GPT-
4V’s capabilities for social multimedia analysis is guided by the following questions:
1.
How effectively does GPT-4V understand the multimodal content on social media plat-
forms? Social media data is inherently multimodal, which requires sophisticated models
that can process and relate information from these different modalities. GPT-4V exhibits a
remarkable capability to understand the visual and textual elements jointly. It demonstrates
its aptitude for explaining the connections between the images and text frequently found
in social media posts, and how these elements can collaboratively serve a multitude of
analytical purposes.
2.
Does GPT-4V demonstrate contextual awareness in its interpretation of social media mul-
timedia content? The meaning of content on social media is often context-dependent.
Effective analysis requires understanding the context in which content is produced and
consumed. Our observations suggest that GPT-4V possesses the capability to understand
context and subtleties within multimodal social media content such as memes, puns, and
misspellings, etc. This ability paves the way for its application across various domains,
including but not limited to, the automatic detection of sarcasm [
31
] and the study of visual
persuasion [30].
3.
What is GPT-4V’s performance when analyzing novel content? Social media language
and content trends are constantly changing. Analysis methods should be able to adapt to
new knowledge, slang, memes, and communication styles. GPT-4V, while adept in various
analytical tasks, still faces challenges in accurately assessing areas like ideology detection
and fake news identification, particularly when these require insights into knowledge or
contexts beyond its training data.
4.
How does GPT-4V navigate the complexities of language and cultural diversity in its analysis
of multimodal social media content? Social media platforms are used globally, necessitating
4
Hate speech? Ideology?
GPT-4V:
… mimic a colloquial or pejorative pronunciation
of "immigrants," which could be interpreted as
mocking or disparaging.
…, implying a more left-leaning perspective.
GPT-4V:
… evoke emotions such as warmth, happiness, or
compassion in viewers.
… This statement could be false or used
satirically…
Sentiment? Fake News?
okok
okok
(a) GPT-4V for Social Media in Daily life.
(b) The Pipeline of GPT-4V for Five Social Media Analysis Tasks.
Figure 1: In this study, we carefully select five typical social multimedia analysis tasks including
sentiment analysis, hateful speech detection, fake news identification, demographic inference, and
ideology detection. We adopt GPT-4V as a unified framework with different prompts (e.g., What
sentiment does this combination convey?) to explore the GPT-4V’s ability for social multimedia.
models that can handle multiple languages and cultural contexts. GPT-4V, in its current state,
demonstrates limitations when it comes to managing the intricacies involved in language and
cultural heterogeneity in its evaluations of multimodal social media content. Its proficiency
is notably centered around English, with less effective understanding of other languages.
Furthermore, the model exhibits a lack of comprehensive cultural sensitivity, which is
essential for navigating the complexities of global social media communications effectively.
Our Approach in Exploring GPT-4V(ison). Social multimedia analysis includes various tasks
that involve the extraction, interpretation, and classification of information from content shared
on social media platforms. In this report, we focus on tasks that are pivotal in this area, each of
which is associated with its own set of quantitative benchmark datasets. These tasks encompass
sentiment analysis, hate speech detection, fake news identification, demographic inference, and
ideology detection (summarized in Figure 1). For each of these tasks, we draw samples from two to
three well-established benchmark datasets to present preliminary quantitative results and qualitative
insights into the performance of GPT-4V in the context of social multimedia analysis. Figure 2
provides an overview of the emerging properties identified in GPT-4V within the social multimedia
analysis area. As highlighted by Yang et al. [
86
], it has become apparent that some of the existing
benchmarks may no longer effectively evaluate LMMs. In the course of our exploration, we have
encountered similar challenges such as GPT-4V’s propensity for memorization during training within
5
GPT-4V(ison) as A Social
Media Analysis Engine
Sentiment Analysis
Sentiment-Infused Caption Generation and Interpretation
Image-Text Sentiment Correlation
Contextual Acuity in Sentiment Understanding
Fine-Grained Sentiment Assessment
Hate Speech Detection
Hate Speech Detection with Cultural Insights
Hatred Decoding in Seemingly Neutral Image-Text Pairs
Nuanced Assessment of Potential Hate Speech
Slang and Subtext Recognition
Fake News Identification
Tone and Language Analysis for Authenticity
Celebrity Knowledge Inquiry
Source Credibility Profiling
Cross-Referencing from Broad Information Spectrum
Demographic Inference
Gender Clue Analysis in Textual Narratives
Ambiguous Signal Interpretation for Gender Prediction
Acknowledgment for Diversity and Complexity of Gender Identity
Ideology Detection
Text-centric Political Ideology Assessment
Comprehensive Political Domain Knowledge
Ideological Deductions from Visual Subtleties
Figure 2: Overview of the emerging properties of GPT-4V for social multimedia analysis tasks.
the context of social media analysis. Therefore, in alignment with Yang et al. ’s approach, we
primarily rely on qualitative results to provide preliminary insights into the potential capabilities of
GPT-4V in the domain of social multimedia analysis. Moreover, we have intentionally included a mix
of examples, drawing from both existing datasets and newly designed challenging scenarios aimed at
evaluating LMMs’ performance.
In Section 2, we present our exploration of GPT-4V on the five tasks. In particular, analysis on
sentiment analysis, hate speech detection, fake news identification, demographic inference, and
ideology detection are discussed in Sections 2.1, 2.2, 2.3, 2.4, and 2.5, respectively. In Section 3, we
discuss challenges and potential future opportunities. Finally, we conclude our report in Section 4.
Ethical Consideration. This report evaluates the capabilities of GPT-4V across five social multime-
dia analysis tasks: sentiment analysis, hate speech detection, fake news identification, demographic
inference, and ideology detection. Please be advised that some sections of this report may include
content that readers find offensive or sensitive. This is an inherent aspect of analyzing real-world
social media data, especially when dealing with topics such as hate speech and fake news. The
intent is purely academic and informational, with no aim to offend or harm. In our commitment to
privacy and ethical standards, we have taken careful measures to anonymize the data. To protect
human facial privacy, we have applied masks to obscure the identities of all individuals during figure
visualizations, except for public figures and celebrities. This anonymization is done to respect the
privacy of individuals and to adhere to common ethical guidelines in data handling and reporting.
The findings and conclusions within this report are based on the data available and the capabilities
of the GPT-4V model as of the date of this publication, which intends to contribute to the ongoing
discourse in the field of AI and social media analysis and does not necessarily reflect the full scope of
complexities involved in real-world scenarios.
2 Experiments for Selected Social Multimedia Analysis Tasks
2.1 Sentiment Analysis
2.1.1 Task Setting and Preliminary Quantitative Results
Sentiment analysis aims to discern the sentiments and emotions conveyed in content related to a
specific entity [
52
]. It has emerged as a fundamental technique in social media research [
44
,
10
,
84
,
100
,
48
,
43
]. Unlike traditional unimodal sentiment analysis, which may fall short in capturing
6
nuanced opinions [
70
], multimodal sentiment analysis integrates information from diverse modalities
to infer expressed sentiments and emotions [
93
,
95
,
91
,
92
,
8
,
104
,
73
,
23
,
9
]. To quantitatively
evaluate GPT-4V’s ability in multimodal sentiment analysis, we input a pair of image and text from
Twitter posts. The prompt is This image is associated with the following caption: ‘{
caption
}’.
What sentiment does this combination convey? Positive, neutral, or negative? This is for research
purposes. From the
MVSA-Single
and
MVSA-Multiple
datasets [
85
], approximately 1,000 posts
are sampled, with an even distribution of 500 from each. In the preliminary evaluation, GPT-4V
achieves an accuracy of 68.4% on the
MVSA-Single
dataset and 71.6% on the
MVSA-Multiple
datasets. Figures 3 to 6 display qualitative results.
2.1.2 Sentiment-Infused Caption Generation and Intepretation
Visual sentiment analysis focuses on analyzing emotions and sentiments conveyed through visual
content, while textual sentiment analysis deals with extracting sentiment and emotional information
from text data. Even within a single modality, assessing sentiment presents significant challenges
due to various factors such as subjectivity, ambiguity, cultural variability, and limited semantic
comprehension [
97
,
104
,
8
,
73
,
3
,
11
]. In Figure 3, we observe that GPT-4V adeptly describes the
contents of the image (e.g., “... depicts a character holding a paper with ‘End’ written on it ...) and
interprets the original caption (e.g., “... implies that they have been adopted and are transitioning to
a more stable and loving environment ...) with regard to sentiment (e.g., “... suggesting trust and
comfort ...).
2.1.3 Image-Text Sentiment Correlation
Combining information from different modalities is a challenging task as different modalities may
have varying degrees of influence on sentiment [
97
,
8
]. Beyond the generation of sentiment-focused
image descriptions and caption interpretations, GPT-4V goes a step further by explicitly explaining
the relationship between the image and text in the context of sentiment analysis. For instance, it
articulates how one modality can accentuate the sentiment conveyed by another (see Figure 4).
Moreover, there can be divergent sentiment polarities among different modalities; for instance, a piece
of text may express negativity while its accompanying visual element conveys positivity [
26
]. As
illustrated in the fourth example in Figure 4, GPT-4V shows its capability to identify such contrasting
sentiments in image-text social media posts.
2.1.4 Contextual Acuity in Sentiment Understanding
Sentiments and emotions are profoundly influenced by culture. What might be regarded as a positive
or negative emotion in one culture could be interpreted differently in another. Additionally, the
sentiment expressed in social media posts often hinges on the broader contextual backdrop. However,
many studies in sentiment analysis tend to focus primarily on the content of social media while
overlooking its contextual and cultural dimensions. Neglecting these crucial aspects can complicate
the determination of accurate polarity [
8
]. Impressively, we have observed that GPT-4V has the
capacity to generate prompts that exhibit cultural awareness (e.g., “... the thematic elements of
heartbreak and revenge often found in country songs ...) and maintain contextual relevance within
the context of social media (e.g., “The use of ‘D’awe’ further emphasizes a feeling of warmth
and gratitude.). This finding is consistent with prior research, which has demonstrated that large
language models have the capacity to enhance contextual information [45].
2.1.5 Fine-grained Sentiment Assessment
While basic sentiment analysis typically categorizes image-text pairs into broad sentiments like
positive, negative, or neutral, exploring more nuanced sentiment distinctions (such as differentiating
between varying degrees of positivity or negativity) is both challenging and often indispensable for
both textual and visual content [
94
]. More importantly, social media posts also often employ sarcasm
and irony for sentiment expression. These modes of expression often rely on intricate context and
subtleties that machines find challenging to grasp [
8
,
104
,
73
,
34
]. However, our findings indicate
that GPT-4V possesses an impressive capability to discern sarcasm and predict nuanced levels of
positivity and negativity.
7
Prompt:
This image is associated with the following caption: “RT
@USER: Shelter Dogs on Their Way Home For The First
Time -”. What sentiment does this combination convey?
Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is positive. The image shows a dog, seemly
in a car, placing its paw in a human’s hand, suggesting trust
and comfort. The caption indicates that shelter dogs are on
their way home for the first time, which implies that they have
been adopted and are transitioning to a more stable and
loving environment.
Prompt:
This image is associated with the following caption: “RT
@USER: 2 hours down , 6 to go ! Learning to ski is
knackering but so much fun ? @USER #ski #olympia2024
#aching #smiles htt …”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is positive. The caption indicates
enthusiasm and excitement about learning to ski, even
though it is physically tiring. The hashtags like “#smiles”
further reinforce the positive sentiment. The image shows
ski boots, suggesting that the person is actively engaging
in the skiing activity.
Prompt:
This image is associated with the following caption: “My
excited Saturday work face. Is it 5 o’clock yet?”. What
sentiment does this combination convey? Positive,
neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is negative. The person in the image
appears to have a serious expression, and the caption
expresses a desire for the workday to end, suggesting a
lack of enthusiasm for working on a Saturday.
Prompt:
This image is associated with the following caption: “RT
@USER 88infinito888: not finished ? #art #arte #artwork
#artcolor #artist #boligrafo #boceto #colour #creative …
HTTPURL …”. What sentiment does this combination
convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is neutral to positive. The image depicts a
character holding a paper with “End” written on it,
possibly indicating the end of a story or project. The
caption, however, mentions “not finished,” which
contrasts with the image’s message. The hashtags
emphasize the artistic nature of the work and suggest that
it’s a creative project. The overall sentiment seems to lean
more towards an appreciation or showcase of art, with a
hint of incompletion or work in progress.
Sentiment-Infused Caption Generation and Interpretation
Figure 3: Qualitative results on sentiment-infused caption generation and interpretation. GPT-4V
is able to describe the image contents and interpret the original caption, with a particular emphasis
on capturing the conveyed sentiment. The
image descriptions
and
caption interpretations
are
highlighted. Refer to Section 2.1 for detailed discussions.
8
Prompt:
This image is associated with the following caption: “RT
@USER: RT @USER: Anyone missing a cat in #Stokey?
This bloke running around #Benthal, seems a little
distressed. HTTPURL:/ …”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is negative. The caption mentions that cat
seeming “a little distressed” and asks if anyone is missing
a cat, implying concern for the well-being of the animal.
The image shows a white cat that appears to be alone,
further emphasizing the sentiment of concern.
Prompt:
This image is associated with the following caption:
“#nowplaying ELISA - ebullient future / rainbow pulsation
~ THE BEST OF ELISA ~”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is positive. The caption indicates that a
song (“ebullient future”) by ELISA is currently being
played, suggesting appreciation or enjoyment of the
music. The image itself, featuring bright colors, rainbows,
and a sky background, further promotes a positive and
uplifting vibe. The title “rainbow pulsation ~ THE BEST OF
ELISA ~” on the album cover suggests a collection of the
artist’s best songs, which also adds to the positive
sentiment.
Prompt:
This image is associated with the following caption: “RT
@USER: #PakvInd Imran ’s message to team: Abandon
fear of losing via @USER”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is primarily positive. The caption
suggests that Imran’s message to the team is to let go of
the fear of losing, which implies encouragement and a
focus on giving one’s best without the burden of potential
failure. The image shows two individuals in sunglasses,
appearing thoughtful and determined, reinforcing the
serious and motivational tone of the message. Overall, the
sentiment can be considered positive, emphasizing
confidence and resilience.
Image-text Sentiment Correlation
Prompt:
This image is associated with the following caption:
“Pakistan players ecstatic after PCB credited 20 rupees
bonus in their accounts #IndvsPak”. What sentiment does
this combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is mixed. The image shows Pakistan
cricket players in a jubilant mood, which is positive.
However, the caption humorously suggests that their
happiness is because of a trivial monetary bonus, which
could be perceived as mocking or sarcastic, thus adding
a negative sentiment. Overall, the sentiment could be
seen as neutral to negative due to the sarcastic nature of
the caption.
Figure 4: Qualitative results on image-text sentiment correlation. GPT-4V explicitly explains the
interplay between the image and text. The
image-text relation interpretations
are highlighted. Refer
to Section 2.1 for detailed discussions.
9
Prompt:
This image is associated with the following caption: “RT
@USER: The NSA just captured the evil side of Valentine’s
Day in this perfect tweet”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is negative. The mention of the “evil side
of Valentine’s Day” in conjunction with the NSA logo
suggests a critical or skeptical viewpoint regarding the
NSA’s actions or communications.
Prompt:
This image is associated with the following caption:
“EXPOSED: free fortune cookie SCAM at uwindsor
marketplace #thatsnotafortune #appalled #fortuneless
#cookiewasgoodtho”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is negative, with a touch of humor
towards the end. The user expresses disappointment with
the non-traditional fortune message, but adds a light-
hearted note about the cookie tasting good.
Prompt:
This image is associated with the following caption: “I
literally just finished eating all the chocolate in my house,
and my dad comes by with this .? D’awe.”. What sentiment
does this combination convey? Positive, neutral or
negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is positive. The image shows a box of
Ferrero Collection chocolates and a card, suggesting a
gift or treat. The caption conveys a sense of pleasant
surprise and appreciation for the unexpected gift,
especially after the mention of having just finished all the
chocolate. The use of “D’awe” further emphasizes a
feeling of warmth and gratitude.
Prompt:
This image is associated with the following caption: “RT
@USER:"@USER: An ex-girlfriend scorned is capable of
anything . #LNSM bahahha”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The combination conveys a humorous sentiment, playing
on stereotypes associated with country music themes.
The image itself presents a joke that ties a real-life event
with the thematic elements of heartbreak and revenge
often found in country songs. The caption reinforces the
humor, suggesting that an "ex-girlfriend scorned is
capable of anything." The use of "bahahha" emphasizes
the comedic intention. Overall, it's a playfully positive
sentiment built around a comedic perspective on the
situation described.
Contextual Acuity in Sentiment Understanding
Figure 5: Qualitative results on contextual acuity in sentiment understandng. GPT-4V generates
responses that demonstrate cultural awareness and are contextually relevant within the context of
social media. The components that are related to
contextual understanding
are highlighted. Refer
to Section 2.1 for detailed discussions.
10
Prompt:
This image is associated with the following caption: “A
#boyhood like no other @USER #thrilling #dark #mystery
#horror #depraved #faurtale”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is positive towards the content (as it
praises or promotes the book/movie/show) but the
themes or subject matter are dark and intense, as
suggested by terms like “thrilling,” “dark,” “mystery,”
“horror,” “depraved,” and “fairytale.” The overall mood
implied by the themes is negative, but the sentiment
towards the work itself is positive.
Prompt:
This image is associated with the following caption: “RT
@USER: @USER Thought you might like my image of a
wolf I recently finished”. What sentiment does this
combination convey? Positive, neutral or negative?
GPT-4V:
The sentiment conveyed by the combination of the image
and the caption is primarily neutral to positive. The image
showcases a wolf with a focused and intense gaze, which
can be interpreted as a display of beauty and majesty of
nature. The caption indicates that someone is sharing
their work with another person, suggesting pride in their
artwork and a desire to share it. The context is not overtly
positive or negative but leans towards a positive
sentiment due to the sharing and appreciation of art.
Prompt:
This image is associated with the following caption: “RT
@USER: Barclays Egypt pumps EGP 250m for small,
medium projects through 2015 -”. What sentiment does
this combination convey? Positive, neutral or negative?
GPT-4V:
The combination conveys a neutral sentiment with a
slightly positive undertone. The caption indicates that
Barclays Egypt has invested in small and medium
projects, which can be seen as a positive action for
economic development. However, without additional
context or emotive language, the overall sentiment
remains relatively neutral.
Fine-grained Sentiment Assessment
Prompt:
This image is associated with the following caption:
“Cheeky close up #tattoo #sad #blue #bleak #robtallon
#tattooer #cheltenham #uk #bearded #nofilter
#traditionaltattoo”. What sentiment does this combination
convey? Positive, neutral or negative?
GPT-4V:
The combination conveys a neutral to negative sentiment.
While the caption mentions "cheeky close up" suggesting
a light-hearted approach, terms like "sad," "blue," and
"bleak" imply negative emotions. The image of the tattoo
also depicts a sad and melancholic expression, reinforcing
this sentiment.
Figure 6: Qualitative results on fine-grained sentiment assessment. GPT-4V showcases a higher level
of sophistication in sentiment analysis, extending beyond the basic categorization of positive, neutral,
and negative sentiments. The components that are related to
fine-grained sentiment
are highlighted.
Refer to Section 2.1 for detailed discussions.
11
2.2 Hate Speech Detection
2.2.1 Task Setting and Preliminary Quantitative Results
Hate speech refers to speech or statements, that discriminate against, threaten, or incite violence
or prejudicial actions against individuals or groups based on attributes such as their race, religion,
ethnicity, sexual orientation, gender, or disease [
32
]. Multimodal hate speech detection involves
identifying hate speech or offensive content in multiple types of media such as text, images, and
videos [
50
,
51
,
33
,
13
]. Multimodal data often present challenges due to the inherent variability
in annotations stemming from subjective interpretation. The task of drawing inferences from a
combination of images and text introduces a level of complexity that surpasses text-only analysis [
21
].
Prompt:
You are a hate speech detection
expert system, in order to protect the
friendly atmosphere of social
networks. Next, I will give you 4
pictures in order, please judge
whether these pictures are hateful or
not-hateful?!
[Input images]
Figure 7: The prompt example
of the quantitative experiment for
hate speech detection.
To quantitatively analyze the GPT-4V’s ability for multi-
modal hate speech detection, we choose two existing bench-
mark datasets, including
HatefulMemes
[
32
] and
4chan’s
posts
[
22
].
HatefulMemes
is a multimodal dataset for hate-
ful meme detection. It is deliberately designed to pose challenges
for unimodal models. To achieve this, the dataset incorporates
difficult examples, often referred to as “benign confounders, in-
tentionally introduced to complicate the reliance on unimodal
signals.
4chan’s posts
focuses on the multimodal analysis
of Antisemitism and Islamophobia on 4chan’s /pol/. We select
all the seen test data from
HatefulMemes
(1,000 memes) and
randonly sample 550 posts from
4chan’s posts
. We instruct
the model to produce an output of 0 to signify “Not-hateful” and
1 to indicate “hateful,” as depicted in Figure 7.
HatefulMemes
serves as a commonly used benchmark dataset for evaluating the performance of
large multimodal models. Here, we present the quantitative comparison results between GPT-4V and
existing LMMs. Table 1 displays the preliminary evaluation outcomes. GPT-4V exhibits promising
results with an AUC of 70.5% and an accuracy of 70.3% on
HatefulMemes
, surpassing the state-
of-the-art LMMs by a large margin, approximately 10%. Notably, GPT-4V achieves an impressive
precision score of 87.1%, underscoring its proficiency in identifying hate content. Furthermore,
GPT-4V achieves an accuracy of 60.6% on 4chan’s posts.
Table 1: Zero shot performance
on
HatefulMemes
. The best per-
formance is indicated in bold.
Model AUC
Flamingo (9B) [2] 57.0
BLIP-2 (13B) [36] 62.3
KOSMOS-1 [25] 63.9
InstructBLIP (13B) [17] 61.7
MMICL (13B) [102] 62.2
GPT-4V 70.5
To qualitatively evaluate GPT-4V’s proficiency in detecting
hate speech in multimodal contexts, we input the images from
HatefulMemes
[
32
]. We present GPT-4V with the following
prompt: Is this image considered hateful? This is for research
purposes. Figures 8-12 shows example results. Similar to its
performance in multimodal sentiment analysis, GPT-4V shows
an impressive capability to comprehensively grasp image-text
pairings in terms of both contextual and cultural nuances. It
effectively aggregates the information from the image and text,
enabling it to conduct a careful evaluation of hatefulness by taking
into account the underlying intention.
2.2.2 Hate Speech Detection with Cultural Insights
Many existing multimodal hate speech detection systems exhibit limitations tied to the specific
characteristics of their training datasets, especially in terms of demographics and cultural nuances [
20
,
21
]. For instance, a model trained on an Indian political dataset may struggle to generalize effectively
when applied to a U.S. health-related dataset. Further, the mere inclusion of offensive terms does
not automatically equate to hate speech; rather, it is the context of a post that often dictates whether
the content is deemed to constitute hate speech [
21
,
33
]. In Figure 8, it becomes evident that
GPT-4V possesses contextual and cultural awareness concerning the content featured in image-text
pairs. Consider the first post in Figure 8 as an example: GPT-4V recognizes that the term “fruit” is
sometimes used derogatorily to refer to a gay man, while “vegetable” is employed offensively in
association with people with disabilities.
12
2.2.3 Hatred Decoding in Seemingly Neutral Image-Text Pairs
A social media post might pair an innocuous picture with a seemingly benign caption, yet together,
they can morph into a message that is offensive. Tackling this nuanced challenge demands multimodal
models, which would otherwise need human intelligence to discern the complex interplay between
image and text. Despite advances in modeling multimodalities, current multimodal models lag behind
trained non-expert annotators when it comes to identifying hate speech [
32
]. The art of accurately
mapping the detected objects and their spatial relations in the image to the accompanying text is
no trivial feat [
1
]. Prior research predominantly focuses on aligning the two modalities, however,
these efforts encounter significant challenges due to the difficulty in comprehending the context
within which the image and text are combined [
18
]. GPT-4V steps up to this intricate task with an
impressive aptitude for jointly interpreting visual and textual information to detect undertones of
hatefulness. The capability of GPT-4V is exemplified in various cases depicted in Figures 9, 10,
and 11. Take, for instance, the lower-left example in Figure 9: the text in isolation does not indicate
dehumanization. Yet, GPT-4V deciphers the underlying contempt, interpreting the phrase as one
that dehumanizes an individual in the photo by likening them to farm equipment. Furthermore, in
Figures 10 and 11, we delve deeper by presenting GPT-4V with targeted follow-up prompts that
query how the visual elements contribute to the overall impression of hatefulness. This exercise
demonstrates GPT-4V’s sophisticated analysis; it recognizes the transformative impact of text on an
image that is otherwise neutral. Even when the image alone does not convey hate, GPT-4V detects
how the added text might shift its context, revealing the potential for offensive connotations. While it
is not infallible in detecting every subtle correlation, GPT-4V’s advanced multimodal comprehension
marks a significant step toward understanding the nuanced dynamics of image-text interactions.
2.2.4 Nuanced Assessment of Potential Hate Speech
Assessing the intention behind a post is critical in evaluating its potential for hatefulness, as the intent
shapes the context and intrinsic meaning of the deployed words and images. These elements can
adopt various interpretations based on their usage context. Correctly identifying this context is where
intention plays a pivotal role. A seemingly offensive phrase could, in fact, be a satirical, educational, or
commentary excerpt from a movie or a book when framed within its appropriate context. As shown in
Figure 12, not all image-text pairings exhibit clear signs of hatefulness or offensiveness in the absence
of additional context. GPT-4V adeptly recognizes such neutrality. Significantly, it goes further to
elucidate that, under more rigorous scrutiny of their underlying intent, these combinations may indeed
convey offensive or hateful undertones through subtly demeaning or derogatory expressions. In this
light, GPT-4V underscores the imperative of integrating broader dimensions into the analysis of hate
speech, a trajectory that is critical for future research on augmenting the evaluation process with
supplementary information [68].
2.2.5 Slang and Subtext Recognition
Misspelling is widely existing in social media and it plays a critical role in expressing the meaning of
a sentence. A subtle difference could cause the semantic deviation [
6
]. In case of hatefulness, mis-
spelling can potentially reduce the hateful information because key phrases conveying disrespectful,
racial, and other negative messages aren’t detected. However, even though such a difference is subtle,
GPT-4V could still capture it. In Figure 13 and Figure 14, phrases such as “nigger, “immigrant,”
“Muslims,” and “white” are misspelled with “nigero, “emmi-grunts,” “Muslyns, and “wite”, respec-
tively. The misspellings have similar pronunciations but look different than original phrases. GPT-4V
can recognize them as the original phrases and understand the whole message again with the original
phrases. Note that it is possible that background information helps GPT-4V capture the misspellings.
For example, in Figure 13, GPT-4V analyzes the background information such as “policy” and “stole
land,” which is associated with the topic of misspelled phrases.
13
Prompt:
Is this image considered hateful content?
GPT-4V:
The image shows a group of men with a caption "the
great train robbers." It appears to be a reference to a
notorious criminal event, likely the Great Train Robbery
that occurred in the UK in 1963. The image itself is not
inherently hateful; it depicts a group of men, likely meant
to be a historical reference or perhaps a part of a
reenactment or promotion for a book or film about the
event. The term "hateful content" typically refers to
material that promotes hate or violence against
individuals or groups based on characteristics such as
race, ethnicity, religion, disability, sexual orientation, or
gender. Without additional context, this image does not
appear to promote hate or violence against any such
group and does not contain hateful content as per the
general definition. However, context is key in these
matters. If the image were being used to glorify criminal
activity or to promote violence, that could change the
interpretation of whether the content is hateful. From the
image provided and the information within it, there is no
direct indication of such an intent.
Prompt:
Is this image considered hateful content?
GPT-4V:
The image provided shows a person at what appears to
be a pride event, identifiable by the rainbow colors
typically associated with LGBTQ+ pride. However, the
text overlay contains the phrases "if I had a gone" and "i
could convert this fruit to vegetable," which do not make
clear sense in English and could be interpreted as
nonsensical or as a poor attempt at humor. If the text is
attempting to use the word "fruit" in a derogatory way to
refer to a gay person, which is a dated and oensive
slang, and "convert" in the sense of changing someone's
sexual orientation, it would be considered oensive and
inappropriate. The term "vegetable" can also be
considered insensitive if it's used to refer to someone in a
derogatory way, especially in contexts referring to people
with disabilities. Without clear context, it is dicult to
provide a definitive assessment, but if the intent is to
ridicule or demean based on sexual orientation or to
suggest the possibility of conversion therapy, which is
widely discredited and considered harmful, it would be
considered oensive and inappropriate. It is crucial to
communicate respectfully and avoid language that can be
harmful or disrespectful to individuals and their identities. Prompt:
Is this image considered hateful content?
GPT-4V:
The image shows a person with a joyful or possibly
intense expression, typically associated with celebration
or strong emotion, such as one might exhibit when a
favored sports team scores a goal. The caption "when
your world cup team scores a goal" reflects a common
experience shared by many sports fans around the world
during significant matches, such as those in the FIFA
World Cup. This image, without additional context, does
not appear to contain hateful content. It seems to be
intended to represent a moment of excitement or
happiness in sports. Hateful content typically involves
promoting hatred, violence, or discrimination against
individuals or groups, and there's no indication of that in
this image based on the information provided.
Prompt:
Is this image considered hateful content?
GPT-4V:
The image you've shared features the Twin Towers with
text overlay that seems to reference the September 11
attacks and implies a group of people, not specified,
should not be voted into government. This message
could be interpreted as promoting a political stance
based on the association of a group of people with a
traumatic event, which can be divisive or perceived as
targeting or excluding specific groups based on
nationality, ethnicity, or religion. It can potentially be seen
as promoting xenophobia or collective blame, which can
be harmful and oensive. The interpretation of whether it
is hateful may depend on the context and the policies of
the platform it's shared on, but it certainly has the
potential to be inflammatory.
Hate Speech Detection with Cultural Insights
Figure 8: Qualitative results on hate speech detection with cultural insights. GPT-4V’s responses
exhibit a profound cultural awareness and a strong grasp of context when it comes to addressing
potential hate speech-related content. The components that are related to
cultural insights
are
highlighted. Refer to Section 2.2 for detailed discussions.
14
Prompt:
Is this image considered hateful content?
Prompt:
Is this image considered hateful content?
Hatred Decoding in Seemingly Neutral Image-Text Pairs (1)
GPT-4V:
The image itself is not inherently hateful; it shows a
diverse group of women smiling and embracing each
other, suggesting a positive message of unity and
friendship. However, the overlaid text "yellow black red
and normal" is inappropriate because it uses racial
descriptors for some individuals while implying that one
group is "normal," suggesting a standard from which
other races deviate. This kind of language is oensive and
divisive, promoting a skewed and prejudiced view that
one race is the baseline of normality. Using colors to
describe races is considered outdated and insensitive,
and doing so while contrasting them with "normal" to
implicitly refer to white individuals as a default standard is
racially insensitive and would be considered oensive by
contemporary standards. This kind of message can
perpetuate harmful stereotypes and contribute to racial
discrimination.
GPT-4V:
Yes, the content of the image you provided could be
considered hateful. It appears to contain an antisemitic
message, which is a form of hate speech directed against
Jewish people. Such messages can perpetuate harmful
stereotypes and contribute to a culture of discrimination
and intolerance. It's important to address and challenge
such content in a responsible and constructive manner.
Follow-up Prompt:
How do you know that it is targeting Jewish people?
GPT-4V:
The image contains a caricature of a person with features
that are commonly associated with stereotypical
depictions of Jewish individuals: a beard, a hat that
resembles a style sometimes worn by Jewish men, and a
Star of David, which is a symbol of Judaism. Additionally,
the figure is holding a menorah, which is a candelabrum
used in Jewish worship, particularly during the holiday of
Hanukkah. The text accompanying the image makes a
derogatory statement implying that Jewish people are
naturally disliked and that laws preventing hate against
them are necessary only due to their inherent
characteristics, which is a clear example of antisemitism.
This type of message promotes harmful stereotypes and
discrimination against Jewish people, which is why it can
be classified as hateful content.
Prompt:
Is this image considered hateful content?
GPT-4V:
Yes, the image and caption you've shared can be
considered hateful content. The caption uses a phrase
that dehumanizes the individual in the photo by referring
to a person as "farm equipment," which is inappropriate
and oensive. This type of language has historical
connotations of racism and dehumanization. Such
content is generally against the policies of many platforms
and is disrespectful and harmful.
Figure 9: Qualitative results on hatred decoding in seemingly neutral image-text pairs. GPT-4V
leverages both images and text to assess whether the content is indicative of hatefulness. The
components that are related to the
joint understanding of image and text
are highlighted. Refer to
Section 2.2 for detailed discussions.
15
GPT-4V:
The image you've presented depicts a snowy landscape
and is overlaid with text that states, "just admit it white
America was better." The landscape itself is not inherently
oensive; it's a neutral winter scene. However, the text
overlaid on this image is likely to be considered oensive
and potentially hateful because it appears to suggest that
America was better when it was dominated by white
people, which can imply a preference for a time when
racial inequalities were more pronounced and
institutionalized. This message can be seen as
disregarding the historical context of racial discrimination
and the progress made toward equality and civil rights.
It's important to be aware that such statements can be
deeply hurtful and divisive, perpetuating a sentiment that
is considered racist and exclusionary.
Follow-up Prompt:
Does the visual content contribute to the hatefulness?
GPT-4V:
In the context of the text provided, the visual content of a
snowy landscape does not contribute to hatefulness. The
image alone is neutral, featuring a typical winter scene
without any inherently oensive or provocative elements.
It is the text that carries a message that could be
interpreted as promoting racial division or expressing
nostalgia for a time of unequal racial dynamics, which
could be considered oensive or hateful. The landscape
in this case does not have a direct correlation with the
text's message, unlike scenarios where the image itself
might depict symbols or scenes that are directly
associated with hate or discrimination.
GPT-4V:
The text in the image is suggesting a racially charged and
exclusionary viewpoint, which could be considered
oensive and divisive. It seems to glorify a past time
period in America that was marked by segregation and
systemic racism, implying that this was a better time.
Such statements can be hurtful and dismissive of the
experiences of people who were marginalized and
discriminated against during those times. It is crucial to
be mindful of how statements like these can perpetuate
harmful biases and be disrespectful to the diverse society
that makes up modern America.
Follow-up Prompt:
Does the visual content contribute to the hatefulness?
GPT-4V:
The visual content of the image, showing a seemingly
idyllic family picnic scene from an era that looks to be the
mid-20th century, when considered alone, is not
inherently hateful. It is a representation of a family
enjoying time together. However, the text overlaid on the
image changes the context and implies a message that
the time period depicted—a time when racial segregation
and discrimination were legally sanctioned in many parts
of the United States—was "better." The juxtaposition of
the image with the text can contribute to the harmful
message by idealizing a past that was fraught with
inequality and injustice for many individuals, particularly
people of color. This can be oensive as it seems to
dismiss or overlook the significant struggles and
injustices faced by minorities during that time. The harm
in the content comes from this combination of text and
image that together suggest an exclusionary or racially
insensitive viewpoint.
Prompt:
Is this image considered hateful content?
Prompt:
Is this image considered hateful content?
Hatred Decoding in Seemingly Neutral Image-Text Pairs (2)
Figure 10: Qualitative results on hatred decoding in seemingly neutral image-text pairs. These two
examples illustrate that even in cases where the image itself is not hateful, GPT-4V demonstrates
a deep understanding of the interplay between images and text, enabling it to identify whether the
text could alter the overall context, although it may not always detect potential correlations. The
components that are related to the
joint understanding of image and text
are highlighted. Refer to
Section 2.2 for detailed discussions.
16
GPT-4V:
The image shows a meme with a man, who appears to be
a chef or a television personality, animatedly speaking to
another individual, with an overlaid caption that makes a
humorous statement about the quality of a pizza. The
content does not appear to be hateful; it seems to be
intended as a joke about the standards of Italian cuisine
and the reputation Italy has for its food, particularly pizza.
However, humor can be subjective, and while the image is
not inherently hateful, it's always important to consider
context and audience when sharing such content.
Follow-up Prompt:
What does the goat contribute to the hatefulness, if any?
GPT-4V:
The goat itself in the image is neutral; it is simply an
animal and does not inherently contribute to hatefulness.
However, the context created by the overlaid text can be
problematic if it is perceived as ridiculing a figure or name
held sacred by a particular group of people. In this case, if
the name "Mohammed" is intended to reference the
Islamic prophet, then using an image of a goat along with
such text can be seen as an attempt to demean or insult
a figure revered in Islam. This could be considered
disrespectful and oensive by those who hold the name
and the religious figure it represents in high esteem. The
combination of the image and the text creates a context
that can be interpreted as disrespectful or mocking,
which could be perceived as hateful or incendiary.
GPT-4V:
The image depicts three camels with the text "trust me...
they're all terrorists." This image contains a statement
that could be considered oensive and promoting harmful
stereotypes, as it seems to be making a sweeping
generalization that could be interpreted as associating
camels with terrorism. The image and the text together
appear to play on negative and prejudicial associations
and could be deemed as promoting hate or intolerance.
It's important to be aware of the impact that such content
can have, as it might contribute to spreading harmful
stereotypes or encouraging discrimination.
Follow-up Prompt:
What do camels contribute to the hatefulness, if any?
GPT-4V:
In the image you've presented, the camels are being
anthropomorphized — that is, they are depicted as if they
have human characteristics — with the added text
implying they are saying "trust me... they're all terrorists."
The use of camels in this context likely draws upon
harmful stereotypes that link certain animals commonly
found in the Middle East with terrorism, a stereotype that
broadly and inaccurately associates Middle Eastern
people and cultures with violence and extremism. By
using camels in this manner, the image may be
attempting to satirize or mock those stereotypes, but it
also runs the risk of reinforcing the very prejudices it
might be trying to critique. The impact of such content
can vary greatly among dierent audiences, but it has the
potential to contribute to the spread of hateful sentiments
by using a stereotype that has been historically used to
discriminate against or dehumanize a group of people.
Prompt:
Is this image considered hateful content?
Prompt:
Is this image considered hateful content?
Hatred Decoding in Seemingly Neutral Image-Text Pairs (3)
Figure 11: Qualitative results on hatred decoding in seemingly neutral image-text pairs. The
components that are related to the
joint understanding of image and text
are highlighted. Refer to
Section 2.2 for detailed discussions.
17
Prompt:
Is this image considered hateful content?
Nuanced Assessment of Potential Hate Speech
Prompt:
Is this image considered hateful content?
GPT-4V:
The image you've provided shows two police ocers, a
man and a woman, and includes a caption that reads
"how black women fight crime". On its own, without any
additional context, it's a straightforward depiction of
police ocers, who could be representative of those
working in law enforcement. The caption could be
interpreted in various ways depending on context—it
could be intended to be factual, stating that black women
fight crime by serving as police ocers, or it could be
seen as a setup for a stereotype or joke, which could be
potentially oensive depending on the implied or stated
punchline. However, if the intention behind the caption is
to stereotype, diminish, or mock the role of black women
in law enforcement, it could be considered oensive and
potentially hateful. It is important to consider the context
in which such an image and caption are used to fully
assess the intent and impact it may have.
GPT-4V:
The image itself is of a woman smiling and jogging,
holding a water bottle, with a natural green backdrop.
There is nothing inherently hateful about the image. The
caption "things I love to hunt" does not immediately seem
to connect with the image in a literal or usual context.
Without additional context, the caption could imply many
things, and its interpretation can vary widely. It may imply
that the woman enjoys the challenge of a hunt in a
metaphorical sense, perhaps in the pursuit of personal
goals, fitness achievements, or other non-literal "hunts". #
If the caption were intended to imply something else,
especially something derogatory or demeaning, then it
could be considered oensive. Context is key to
determine the appropriateness or intent of the message.
Prompt:
Is this image considered hateful content?
GPT-4V:
The image shows a person laughing with the text overlay
indicating they heard a funny joke about dishwashers.
This image, as it is presented, is not inherently hateful; it
depicts someone reacting to humor. Without additional
context, it is just an expression of amusement. However,
if the implied joke were to rely on stereotypes or
derogatory themes, it could be considered oensive or
inappropriate by some. The content of the actual joke,
which is not included in the image, would be necessary to
make a determination about whether it could be
considered hateful or not.
Prompt:
Is this image considered hateful content?
GPT-4V:
The image depicts a person peeling o a facial mask, with
a caption that reads "when you've decided to start a life
of crime." This appears to be a humorous take on the
application of a skincare mask, likening it to a
stereotypical 'burglar's mask'. The humor attempts to
play on the visual similarity between the black facial
skincare mask and the kind of mask that might be worn in
popular culture's depiction of a thief or criminal. #
This image isn't considered hateful content as it does not
target or discriminate against any individual or group, nor
does it promote hate or violence. It seems to be intended
as a light-hearted joke, although humor can be subjective
and sometimes misunderstood.
Figure 12: Qualitative results on nuanced assessment of potential hate speech. GPT-4V exhibits a
careful evaluation of posts that may not initially appear hateful but could potentially contain offensive
or hateful content based on their underlying intent, which may be demeaning or derogatory. The
components that are related to the
nuanced assessment
are highlighted. Refer to Section 2.2 for
detailed discussions.
18
Slang and Subtext Recognition (1)
Prompt:
The caption for this image is: Calling you nigero now! !
Help me determine if the image and caption together
express hateful information.
GPT-4V:
The image you've provided appears to show a community
event where a police ocer and some young people are
engaging in a dance. Such scenes can reflect positive
community-police interactions and activities meant to
foster good relationships and community spirit.!
The caption, however, contains language that is not
appropriate. Specifically, the word "nigero" seems to be a
misspelling or a variation of a racial slur that is deeply
oensive and discriminatory. When paired with the
caption, the image could be misinterpreted or used to
convey a message that is hateful or harmful, which is not
acceptable.!
It's crucial to be mindful of the context and the potential
impact of pairing images with text, especially when the
language used could be considered oensive or
disrespectful. It's important to foster respectful and
positive communication, particularly in the context of
community events and interactions between law
enforcement and community members.
Prompt:
The caption for this image is: Stop illegal emmi-grunts
they're taking the land we stole. !
Help me determine if the image and caption together
express hateful information.
GPT-4V:
The caption uses language that could be considered a
satirical or ironic comment on the history of land
acquisition and current immigration debates. It seems to
be making a point about the contradictions in the rhetoric
used in some debates on illegal immigration. The use of
the word "stole" in the context of land may refer to
historical conquests or colonialism, which is a sensitive
subject.!
The term "emmi-grunts" is a nonstandard spelling of
"immigrants" and could be viewed as pejorative or
demeaning, depending on the intent and context in which
it is used. Using derogatory language when referring to
immigrants can be oensive and is often considered
inappropriate.!
While the image is just a photo of an individual and does
not contain hateful content on its own, when combined
with the caption, it may be seen as perpetuating a
dismissive or irreverent stance on serious and complex
issues. It's essential to discuss such matters with
understanding and respect for the profound impact they
have on people's lives.!
Figure 13: Qualitative results on slang and subtext recognition. The misspelled phrase on the left is
nigger
, and on the right is
immigrant
. How GPT-4V captures the misspelling and the position of
the misspelling are highlighted. Refer to Section 2.2 for detailed discussions.
19
Prompt:
The caption for this image is: I'm this pretty because both
my parents are wite. !
Help me determine if the image and caption together
express hateful information.
Prompt:
The caption for this image is: Muslyns moved in... but
they're not moving out.. !
Help me determine if the image and caption together
express hateful information.
GPT-4V:
The caption provided with the image seems to be a play
on words, possibly a misspelling or a fictitious variation of
"Muslims." If that is the case, and it is intended to
reference the Muslim community, then the statement
could be considered inflammatory and oensive, as it
suggests a negative stereotype associating a group with a
destructive event. Such insinuations can contribute to
discrimination and are considered inappropriate.!
However, without further context to confirm the intent
behind the caption, it's dicult to determine the exact
nature of the message. Nonetheless, it's essential to
communicate with respect and avoid language that could
be interpreted as singling out or blaming any community
or group for negative incidents, as this can perpetuate
harmful biases and stereotypes.
GPT-4V:
The caption provided with the image makes a statement
about the person's appearance based on their parents'
race, which can be interpreted as a racially insensitive
remark. It suggests that the individual's attractiveness is
due to having parents of a specific race, which can be
oensive as it implies a superiority based on racial
characteristics. This kind of racial commentary is
generally considered inappropriate and can contribute to
harmful stereotypes or biases.!
The image itself is a portrait of a person posing and
smiling at the camera, and does not contain any hateful
content on its own. However, when combined with the
caption, it can be seen as perpetuating a problematic
message about race and beauty standards. It's important
to be mindful of the implications of attributing qualities
such as beauty to race and to avoid making such
associations.
Slang and Subtext Recognition (2)
Figure 14: Qualitative results on slang and subtext recognition. The misspelled phrase on the left is
Muslims
, and on the right is
White
. How GPT-4V captures the misspelling and the position of the
misspelling are highlighted. Refer to Section 2.2 for detailed discussions.
20
2.3 Fake News Identification
2.3.1 Task Setting and Preliminary Quantitative Results
The ability to identify whether a social media post is authentic or fabricated provides valuable
assistance to fact-checkers [
53
,
14
,
49
,
24
]. However, acquiring more nuanced information
categorizing content as true, satire/parody, misleading, manipulated, falsely connected, or imposter
can be significantly more beneficial [
59
]. Additionally, the issue of cross-modal inconsistency in
social media content presents a considerable challenge in multimodal environments, necessitating
advanced detection strategies [
28
,
29
,
74
]. The dataset we use here is
FakeNewsNet
[
72
], which
contains two categories of news articles: “gossip” and “political news”. The number of test samples
used in the preliminary evaluation is 104 and 500, respectively. GPT-4V achieves an accuracy of
57.2% on the “gossip” dataset and 60.6% on the “political news” dataset. Based on our experiments,
we have found that GPT-4V can assess the authenticity of news from various perspectives. Example
qualitative results are outlined in the subsequent sections.
2.3.2 Tone and Language Analysis for Authenticity
GPT-4V has the ability to construct a chain of thought based on tone as a basis for assessing fake
news. Previous research [
63
] finds that some fake news stories are written in a specific linguistic tone,
but it is inconclusive to say which one is negative, positive, or neutral, not to mention the writing
style. However, as shown in Figure 15, GPT-4V naturally demonstrates the ability to determine the
authenticity of news content based on tone and language style. We are uncertain if there were similar
tasks in the pre-training corpus, but this capability exhibited by GPT-4V seems to address previously
seemingly insurmountable challenges in fake news identification. At the same time, GPT-4V is also
capable of providing high-level summaries, categorization, and discernment of input content, not
solely relying on the story itself, to assess the similarity between the input corpus and real news
content. This demonstrates GPT-4V’s multifaceted understanding of the definition of “news.”
2.3.3 Celebrity Knowledge Inquiry
GPT-4V has the capability to make factual inferences based on specific entities it has learned, which
may not only be abstract concepts (e.g., “news”) but also concrete individuals or actual events that
have occurred. As shown in Figure 16, GPT-4V can judge the authenticity of input news content
based on the information it has learned about Brad Pitt and Angelina Jolie in its “knowledge. In other
words, if the news content aligns with the information in its “knowledge, GPT-4V is more likely
to consider the input as a genuine news story. Conversely, it would be less inclined to do so if the
information does not match its knowledge. However, this could also potentially lead to misjudgments
by GPT-4V due to its underlying “stereotype” or preconceived notions, causing it to make incorrect
assessments of some “unexpected” news.
2.3.4 Source Credibility Profiling
Similar to the previous characteristic, GPT-4V also seems to have established “profiles” for different
information sources and use them as a basis for assessing the authenticity of news. As shown in
Figure 17, ABC News is considered a credible source by GPT-4V, while Gossip Cop appears to be
less trustworthy to GPT-4V. This could be influenced by the fact that in the training corpus, Gossip
Cop often co-occurs with keywords like “fake news.” This also reflects GPT-4V’s strong memory
and attribution capabilities. Similarly, as mentioned earlier, these “profiles” often carry significant
bias due to differences in the training data, ultimately leading to incorrect results.
2.3.5 Cross-Referencing from Broad Information Spectrum
At the same time, GPT-4V can use a much richer background of information and concepts to verify
news content and perform various fact-checking tasks. This includes public information in areas such
as politics, law, and economics, and even includes the political stances of public figures and historical
stock market data, as shown in Figures 18 and 19. This strategy makes the results more credible,
but it appears that we cannot always control GPT-4V to follow this pattern of verification. This also
presents a new challenge of how to encourage GPT-4V to utilize as much known information as
possible when assessing the authenticity of input news.
21
Tone and Language Analysis for Authenticity
GPT-4V:
The caption includes informal language, emojis, and a casual style of writing ("all loved up and canoodling", "Taymerica
inflatable pool slide"). This is more characteristic of informal blog posts or social media content rather than traditional
journalism, which usually maintains a more neutral and formal tone. !
The caption references specific products and includes a price and a link to shop, which is typical of sponsored content
or fashion blogs but not standard in news reporting. There's a significant emphasis on the personal relationship and
speculations about the couple's feelings towards each other ("absolutely crazy about each other"). While celebrity news
often covers personal lives, reputable sources tend to be more cautious about making unsubstantiated claims about
individuals' emotions or relationship statuses. !
The caption makes a speculative political statement about Taylor Swift's choices during a specific presidency. Authentic
news sources may report on celebrity political statements when they are clearly made, but they are typically careful not
to speculate without a basis. Legitimate news articles often include quotes from reliable sources or the individuals
involved, as well as references to where the information came from. !
This caption seems to lack such verification. Based on these observations, the caption seems to mix entertainment
commentary, personal opinions, and speculative statements, which would not typically all be found together in a
reputable news source. While this doesn't necessarily mean that the events didn't happen, it does suggest that the
caption may not adhere to journalistic standards of verification and neutrality.!
Prompt:
Is this news true or false? Why? This news including an image is associated with the following new caption: !
“Taylor Swift Spent Her Fourth of July in Turks and
Caicos with Boyfriend Joe Alwyn Is it even a proper
Independence Day celebration without fireworks, Stars
and Stripes face paint, and photos from one of Taylor
Swift's infamous Fourth of July parties? Yes it is, but still
the Internet had to do without the latter in 2018, because
this was the second year in a row that her home in Watch
Hill, Rhode Island was empty on America's birthday. (In
other words, perhaps Taylor not celebrating the 4th
during Trump's presidency is her biggest political
statement yet?)!
Instead of holding one of her #squad-powered summer bashes this year, T-Swift chose to spend a break from her
current world tour in Turks and Caicos with her boyfriend actor Joe Alwyn.!
The couple were spotted all loved up and canoodling on the beach and in the clear tropical waters. And yes, while
Swift abandoned America to vacation a British territory islands with her British beau, she was spotted sporting a red,
blue, and white striped bikini from swimwear brand Solid and Striped. America represent! !!
The Jamie Top solidandstriped.com $88.00 Shop Now!
Taylor and Joe allegedly first met at the 2016 Met Gala, but their ~relationship status~ wasn't confirmed until a year
later in May 2017. A source tells People, "Joe is just an amazing boyfriend. He is very supportive. Taylor adores his
family. She’s in a very good place.”!
On the few occasions they have been spotted together in public, the couple have always looked like they're absolutely
crazy about each other, but have made sure to hide their love from the public eye as much as possible. In fact, these
photos mark the first time Taylor and Joe have been spotted together since their lunch date in North London over a
month ago. In other words then, this is a far cry from those seemingly-posed, definitely cheesy beach pics with her
(also British) ex Tom Hiddleston.!
Well the two certainly look like they're having a lovely time on their little beach getaway, soaking in some sun and
snorkeling and hopefully enjoying colorful cocktails too. But a gigantic Taymerica inflatable pool slide would've made it
even better, right?! Then again, as we see in the photo, who needs a floatie when you've got a... dreamboat to lean on?!
Follow Jasmine on Instagram.”
Figure 15: Qualitative results on tone and language analysis for authenticity. GPT-4V can lever-
age the tone and language-use for fake news identification. The components that are related to
tone and language analysis are highlighted. Refer to Section 2.3 for detailed discussions.
22
Celebrity Knowledge Inquiry
GPT-4V:
Brad Pitt and Angelina Jolie, during their marriage, were known for their philanthropic eorts. The text mentions the
Jolie-Pitt Foundation, which was a real charitable organization founded by them. They were indeed reported to have
donated millions of dollars to various causes over the years. The specific figures and allocations mentioned in the text
are plausible based on their known philanthropic history. Brad Pitt did establish the Make It Right Foundation, which
focused on building sustainable housing in New Orleans after Hurricane Katrina, and Angelina Jolie has been involved
in humanitarian eorts and has served as a Goodwill Ambassador for the UNHCR."
The charities mentioned, such as Doctors Without Borders and Global Action for Children, are consistent with the types
of organizations they have supported in the past. Also, Jolie portrayed Marianne Pearl in the movie "A Mighty Heart,"
which aligns with the mention of a donation to the Daniel Pearl Foundation."
Regarding the authenticity of the text being linked to an article from the Hungton Post, it is common for tax filings of
public figures, especially those involved in charitable activities, to be reported by credible news outlets. However,
without access to the specific Hungton Post article or the tax filings mentioned, I cannot verify the exact figures or the
current charitable activities of Pitt and Jolie."
Given these historical contexts and the philanthropic patterns of behavior known for both individuals up to April 2023,
the information provided in the text is consistent with their actions in the past.
Prompt:
Is this news true or false? Why? This news including an image is associated with the following news caption: "
“Brad Pitt & Angelina Jolie Gave $8 Mil to Charity News
cameras have captured Brad Pitt building homes in New
Orleans, and now federal tax filings for the Jolie Pitt
Foundation have documented the good works that Brad
and Angelina Jolie are doing behind the scenes."
All told, they gave more than $8 million in donations in
2006, with Pitt putting in $4,402,317, and Jolie donating
$4,123,613, the Hungton Post reports, citing newly
released federal tax filings. (These latest figures don’t
include 2007 – in which, among other charitable eorts,
Brad’s Make It Right Foundation raised $5 million.)"
In terms of dispensing their funds, according to records for 2006, the foundation handed out $2,367,935, including $1
million to both Doctors Without Borders (International Division) and to the Global Action for Children."
Additional gifts included $137,935 to the Namibia Red Cross Action Program (their daughter Shiloh was born in the
African country); $100,000 to the Daniel Pearl Foundation (the late journalist’s widow, Marianne Pearl, was played by
Jolie in A Mighty Heart); $100,000 to Global Green U.S.A. and to the Epidermolysis Bulloma Medical Research
Foundation."
In addition, $20,000 went to an art park in Los Angeles, as did separate $5,000 donations to two orphanages that Jolie
had visited."
More Good Works"
In terms of Pitt’s work in Louisiana, Jolie’s costar in the movie Wanted, Morgan Freeman, told PEOPLE Thursday at the
Pathfinders to Peace Forum and Gala, “We were in contact with Brad about the situation in New Orleans” – and he
called Jolie “such a wonderful woman.”"
Said Freeman: “Aside from all that physical beauty, she’s a real person.”"
Like this story? Click on the Yahoo Buzz! button below to Buzz it up!"
Figure 16: Qualitative results on celebrity knowledge inquiry. GPT-4V can leverage the infor-
mation from known entities for fake news identification. The components that are related to
celebrity knowledge inquiry are highlighted. Refer to Section 2.3 for detailed discussions.
23
Source Credibility Profiling
Prompt:
Is this news true or false? Why? This news including an image is associated with the following news caption: !
Serena Williams and fiance Alexis Ohanian introduce daughter
Alexis Olympia Tennis star Serena Williams and fiance Alexis
Ohanian have ocially introduced their daughter, Alexis
Olympia Ohanian Jr.!
ABC News' Katie Kindelan contributed to this report. Meet
Alexis Olympia Ohanian Jr. You have to check out link in bio for
her amazing journey. Also check out my IG stories ???????? A
post shared by Serena Williams (@serenawilliams) on Sep 13,
2017 at 6:39am PDT”
GPT-4V:
The text provided discusses Serena Williams and Alexis Ohanian introducing their daughter, Alexis Olympia Ohanian Jr.
To assess whether this is likely true or fake news, we would consider the following:!
1. Public figures like Serena Williams and Alexis Ohanian would likely announce significant life events, such as the birth
of a child, on their ocial social media accounts. The detailed account of this event on Instagram and YouTube fits the
pattern of how celebrities share personal news.!
2. The description mentions specific details that would be verifiable, such as the child's Instagram handle and the
weight at birth, suggesting it may be drawn from a press release or a direct quote, which are usually reliable.!
3. The text references a source, ABC News, and a journalist by name, Katie Kindelan. This attribution gives more
credibility to the text, as it suggests that the information could be verified through an established news outlet.!
Based on these points and without contrary evidence, I would lean towards the information being accurate.
Prompt:
Is this news true or false? Why? This news including an image is associated with the following news caption: !
Robert Pattinson And Kristen Stewart Are Not Getting Back
Together “Twilight” fans have been eagerly awaiting for the day
that stars Robert Pattinson and Kristen Stewart would get back
together following their high profile break up.!
While fans of Pattinson and Stewart are still hoping the pair will
get back together, it looks like the exes have their sights set on
other people. Photo: Jason Merritt/Getty Images”
……
……
GPT-4V:
Based on the text provided, the article claims that Robert Pattinson and Kristen Stewart are not getting back together,
and this information is allegedly confirmed by a source that spoke to Gossip Cop, which is known for fact-checking
rumors in the entertainment industry. Given this context, and without the ability to confirm current events beyond my
last update, if I were to provide a confidence score, it would be more inclined to suggest that the rumor of them getting
back together is not true, based on the pattern of such rumors typically being unsubstantiated.!
Figure 17: Qualitative results on source credibility profiling. GPT-4V can leverage the “pro-
file” of the information source for fake news identification. The components that are related to
source credibility profiling are highlighted. Refer to Section 2.3 for detailed discussions.
24
Cross-Referencing from Broad Information Spectrum (1)
GPT-4V:
Firstly, the claim that Georgia has become the first state to ban "Muslim culture" to restore "Western values" is a
significant red flag. It is highly unlikely that any US state would or could ban a culture, as this would violate the
principles of freedom of religion and expression enshrined in the United States Constitution. Moreover, such actions
would likely be deemed unconstitutional and discriminatory against a particular religious group.!
Secondly, the reference to a Georgia state legislator introducing legislation to ban burqas and veils is a more specific
claim that can be fact-checked. While there have been various attempts by individuals in state legislatures to introduce
bills that could aect the wearing of religious garments, to my knowledge, none have successfully enacted a complete
ban on such garments or "Muslim culture" broadly by April 2023. These types of legislative eorts are typically met with
significant legal and public opposition.!
Furthermore, the text includes inflammatory language and broad generalizations about Islam, which are indicative of
biased or fake news. Reputable news sources adhere to journalistic standards that require balanced reporting and
typically avoid sweeping generalizations about entire religions or cultures.!
Lastly, the tone and language used in the text ("ban on Muslims entering the United States," "Islam is very dangerous")
are consistent with sensationalist and biased reporting, often found in disinformation campaigns or fake news articles.!
Based on these considerations, it is likely that the text you provided is not from a credible news source and is potentially
spreading misinformation or disinformation. It's important to cross-reference any such claims with reputable news
sources and ocial statements before accepting them as true.!
Prompt:
Is this news true or false? Why? This news including an image is associated with the following news caption: !
“SPECIAL REPORT: GEORGIA BECOMES FIRST STATE TO
BAN MUSLIM CULTURE IN HISTORIC MOVE TO RESTORE
WESTERN VALUES! Share tweet:!
When Donald Trump proposed in 2015 that we place a
temporary ban on Muslims entering the United States, it
was at a time when Europe was experiencing historical
levels of migrations and terrorism. ISIS was carrying out
terrorist attacks weekly and hundreds of innocent people
were getting killed.Whether it was a jihadi running over 85
people with a truck, the Paris bombings, or even the German priest that was beheaded for being a Christian–they all
were fueled by Islam. People need to understand that Islam is very dangerous and contradicts everything we believe in
the United States.!
When Donald Trump proposed in 2015 that we place a temporary ban on Muslims entering the United States, it was at
a time when Europe was experiencing historical levels of migrations and terrorism. ISIS was carrying out terrorist
attacks weekly and hundreds of innocent people were getting killed.Whether it was a jihadi running over 85 people
with a truck, the Paris bombings, or even the German priest that was beheaded for being a Christian–they all were
fueled by Islam. People need to understand that Islam is very dangerous and contradicts everything we believe in the
United States!
A Georgia state legislator has introduced legislation that would ban burqas and traditional Muslim veils.!
State Rep. Jason Spencer (R) filed legislation on Wednesday that would prevent women from posing for driver’s
license photos while wearing a veil. The measure also adds burqas to a law that makes it a misdemeanor to wear a
face covering in public or while driving.!
Spencer argues that this bill is specifically intended to reduce fear about Islamic terrorism. Spencer’s bill would
prevent women from wearing head coverings while driving or while walking down a public sidewalk.”
Figure 18: Qualitative results on cross-referencing from broad information spectrum. GPT-4V
primarily leverages various related data for fake news identification. The components that are related
to cross-referencing are highlighted. Refer to Section 2.3 for detailed discussions.
25
GPT-4V:
The text presented is a political news article that appears to be reporting on former House Minority Leader Nancy
Pelosi's statements regarding minimum wage in the United States. This text refers to several factual and historical
details which could potentially be verified, such as Pelosi's position on wage increases, the stock market performance,
legislative proposals, and past actions by Congress and the President.!
Here are some key points that could be cross-checked against reliable sources up to April 2023 to determine the
veracity of this article:!
1. Nancy Pelosi's Position: Nancy Pelosi has been a consistent advocate for increasing the minimum wage, so this
statement aligns with her known policy positions.!
2. Historical Stock Market Performance: The specific record-high levels of the stock market mentioned (Dow Jones
topping 14,200) can be checked against historical stock market data.!
3. Legislation Proposed: The legislation by Senator Harkin and Representative George Miller to raise the minimum wage
could be verified through congressional records.!
4. President Obama's Position: President Obama's advocacy for a $9 minimum wage was indeed a part of his policy
platform, and his State of the Union address would reflect this.!
5. Minimum Wage History: Congress did pass a minimum wage hike in 2007, which was signed into law by President
George W. Bush, as part of a larger bill.!
6. Economic Context: Claims about wage stagnation, the unemployment rate, and corporate profits could be cross-
checked with economic data from reputable sources like the Bureau of Labor Statistics or economic research institutes.!
7.Business Lobby's Opposition: It is common for business groups to oppose minimum wage hikes due to concerns
about their impact on hiring and profitability, so this claim would be consistent with typical lobbying patterns.!
Given the consistency of the statements with Pelosi's known positions and historical context, and assuming no
contradictory evidence has emerged since my last update in April 2023, it is plausible that the news article is accurate.
Prompt:
Is this news true or false? Why? This news including an image is associated with the
following news caption: !
“Pelosi urges minimum-wage hike House Minority Leader Nancy Pelosi on Thursday
urged Congress to hike the hourly minimum wage dramatically. Noting that Wall
Street this week is trading at record-high levels, the California Democrat said those
gains have done nothing to benefit middle-class workers, and called on Congress to
close the gap. “This week, we saw something quite remarkable, the stock market
soaring to record heights. At the same time, we see productivity keeping pace,"
Pelosi told reporters in the Capitol. "But we don't see income for America's middle
class rising. In fact, it's been about the same as since the end of the Clinton years."!
ADVERTISEMENT!
Pelosi said the negative eects of wage stagnation on the middle class have been compounded by the bursting of the
housing bubble and the recession that followed. She urged Congress to take up legislation — sponsored by Sen.(D-
Iowa) and Rep. George Miller (D-Calif.) — to hike the minimum wage from $7.25 to $10.10 over three years, while
indexing future increases to inflation."If we are going to honor our commitment to the middle class," she said, "we have
to reflect that intention in our public policy."The Democrats' bill goes much further than President Obama advocated in
his February State of the Union address, in which he urged an increase to $9 per hour. Harkin said recently that Obama
"missed the mark" with the lower figure.Pelosi's comments came in a week when the Dow Jones Industrial Average
topped 14,200 for the first time in history even as wages, as a percentage of the economy, are at an historic low (43.5
percent of GDP last year) and the nation's unemployment rate has hung stubbornly near 8 percent for roughly six
months.Economists say several factors can explain why corporate profits are not trickling down to benefit the working
class, including the ever-rising productivity of the nation's workforce and a reluctance among companies to hire in a still-
volatile economy.Congress last approved a minimum-wage hike in 2007 as a rider to a must-pass bill providing funds to
the troops in the Iraq War. The wage hike was conditioned on the inclusion of $5 billion in business tax breaks. The
package was signed into law by then-President George W. Bush.With the business lobby warning that a minimum-wage
hike wold cripple hiring amid a jobs crisis, the Harkin-Miller bill has little chance of moving through the GOP-controlled
House. But Pelosi and the Democrats are hoping the combination of soaring Wall Street gains and middle-class wage
stagnation will resonate with voters."When we increased it in 2007 ... it was the first time it had been increased in 11
years," Pelosi said. "It's time for it to be increased again.""
Cross-Referencing from Broad Information Spectrum (2)
Figure 19: Qualitative results on cross-referencing from broad information spectrum. GPT-4V
primarily leverages various related data for fake news identification. The components that are related
to cross-referencing are highlighted. Refer to Section 2.3 for detailed discussions.
26
2.4 Demographic Inference
2.4.1 Task Setting and Preliminary Quantitative Results
Multimodal gender identification seeks to determine an individual’s gender through various modalities
such as text, images, and videos linked to them [
90
,
89
,
101
,
77
]. Social media provides these
multimodal resources for profiling each user. LMMs have come forward as zero-shot predictors
for gender identification using social media content. LMMs integrate complementary features for
a robust estimate and have enhanced text comprehension capabilities. As shown in Figures 20-23,
whether it is posts explicitly displaying gender or those requiring inference of implied gender, LMMs
predict gender accurately. To assess GPT-4V in gender inference, we utilize the
PAN18
[
69
] user
profile dataset, which includes images and conversational text shared by Twitter users on social
media. We select paired image-text examples and use the prompt: This image is associated with the
following caption: ‘{caption}’. Is the user likely to be male or female? as input for GPT-4.
To conduct quantitative experiments, our research utilizes
PAN18
[
69
], curated in Arabic, English,
and Spanish, involving conversational texts paired with user-shared images. This data is divided
into two gender categories: female and male. For each of the three language-specific datasets, we
randomly select 500 samples from
PAN18
. In the preliminary evaluation, GPT-4V achieves accuracy
of 70.0%, 78.8%, and 76.2% in Arabic, Spanish, and English, respectively.
2.4.2 Gender Clue Analysis in Textual Narratives
Due to the inherent limitations and characteristics of language, a collection of texts in different
languages may convey almost identical contexts globally but can have slight discrepancies or ambigu-
ities locally. This issue becomes critical when designing predictors requiring binary decisions, like
gender identification, where nuanced features are essential for accurate estimation. It is important to
clarify that this is not solely a design flaw of language models but also a limitation of cross-linguistic
understanding. For example, in Figure 20, gender identification is straightforward in English and
Chinese Twitter posts where personal pronouns have gender implications, as in the English post
My love is mine, he enchanted me. In contrast, in Turkish and Arabic, GPT-4V is not able to
determine gender because pronouns like o in the Turkish post Sevgilim benim, o beni büyüledi
are gender-neutral, similar to the singular they in English, which can refer to a person of any gender
without revealing it. As shown in Figure 21, GPT-4V can accurately predict a user’s gender from
posts in English, Spanish, and Arabic. However, in a Chinese context, GPT-4V may incorrectly
assume that a post about parental leave a topic whose expression and common understanding can
vary due to differing national policies is exclusively by female authors.
2.4.3 Ambiguous Signal Interpretation for Gender Prediction
Gender identification is a traditional prediction task in social media, where traditional methods
struggle to effectively leverage the multimodal resources available in this domain. Supported by
well-trained LMMs, such as GPT-4V, it is expected to jointly learn and integrate knowledge from
both image and text interactions. To test this, we conduct a qualitative ablation study on GPT-4V with
and without multi-modal input in the gender identification task. This involves examining whether
images only, text only, or a combination of both are used as inputs for GPT-4V. Our findings, as
illustrated in Figure 22, using text alone without images leads to errors, but the accuracy improves
when both are used. However, it is important to note that all these users could actually be of the
opposite gender, and the addition of image and text inputs might even strengthen GPT-4V’s biases
towards these groups.
2.4.4 Acknowledgment for Diversity and Complexity of Gender Identity
The cultural identities of diverse groups and the evolving perceptions of gender across time and space
have led to the diversity and complexity of gender. Even language models trained on vast Internet data,
like GPT-4, struggle to deeply understand and analyze such diverse user genders from multimodal
social media. A typical example is in Figure 23, a boy whose true gender is male was dressed like a
female. Considering different cultural contexts, gender-specific traits vary. Furthermore, in Figure 23,
makeup items are not exclusively female. In these examples, GPT-4V is able to acknowledge the
diversity and complexity of gender identity.
27
Prompt (English):
This image is associated with the following caption: “My
love is mine, he enchanted me ❤❤.” Is the user likely to
be male or female?
Gender Clue Analysis in Textual Narratives (1)
GPT-4V:
The caption "My love is mine, he enchanted me ❤❤,"
suggests that the user is expressing aection for someone
who has enchanted them and uses the pronoun "he" to
refer to their love. Based on traditional gender roles and
the use of language, it could be assumed that the user is
likely female, expressing aection for a male. However, it's
important to note that this is not definitive; individuals of
any gender can use such language to refer to their loved
ones.
Prompt (Turkish):
This image is associated with the following caption:
“Sevgilim benim, o beni büyüledi” Is the user likely to be
male or female?
Prompt (Chinese):
This image is associated with the following caption: “我的
爱⼈属于我, 他让我着迷 ❤❤.” Is the user likely to be
male or female?
Prompt (Arabic):
This image is associated with the following caption: “!"#$"%
!$&'() *$+ !$,-./ *$+❤❤.” Is the user likely to be male or
female?
GPT-4V:
The caption "Sevgilim benim, o beni büyüledi" is in Turkish
and translates to "My lover, he/she enchanted me" in
English. The phrase does not indicate the gender of the
speaker. It's a general expression of aection someone
could have for their partner. The gender of the user cannot
be determined from this caption alone. As for the image, it
shows two Starbucks tumblers, one matte black and the
other with a glittery gold finish. These are just products and
do not provide any information about the user's gender.
GPT-4V:
The caption "!$&'() *$+ !$,-./ *$+ !"#$"%❤❤," can be
translated to "My beloved is the one who owns me, the
one who enchanted me," with a romantic connotation.
Such a statement is typically indicative of aectionate
feelings towards someone else and does not strongly
suggest the gender of the person writing it. In many
cultures, expressions of love and aection are not
exclusive to any particular gender; both men and women
may use such language to describe their feelings.
Therefore, based on this caption alone, it is not possible to
accurately determine the gender of the user.
GPT-4V:
The caption "我的爱⼈属于我, 他让我着迷 ❤❤"
translates from Chinese to "My lover belongs to me, he
fascinates me ❤❤." The use of "" (he) indicates that
the speaker's lover is male. As for the gender of the user,
while it is not possible to determine with certainty without
additional context, the phrase could suggest that the user
is female, given that they are referring to a male lover.
However, it's important to remember that this is not
definitive, as the user could be of any gender and the
statement could be made by anyone regardless of their
own gender.
Figure 20: Qualitative results on gender clue analysis in textual narratives. GPT-4V tries to
use gender clues in textual narratives for gender inference. The components that are related to
gender clue analysis are highlighted. Refer to Section 2.4 for detailed discussions.
28
Prompt (English):
This image from this Twitter post is associated with the
following caption: “Taking a moment from paternity leave
to witness the vastness of the world. #PaternityLeave” Is
the user likely to be male or female?
GPT-4V:
The image you've provided shows the view from an
airplane window, overlooking clouds, which is consistent
with the caption about witnessing the vastness of the
world. The association with the hashtag #PaternityLeave
suggests the user is likely male, as "paternity leave" is
specifically the time o that a father takes after the birth of
his child to care for and bond with the newborn. This term
is conventionally used to refer to fathers, whereas
"maternity leave" is used for mothers. The use of this term
implies that the person taking the leave is male and has
recently become a father or is sharing the experience of a
male partner.
Prompt (Spanish):
This image from this Twitter post is associated with the
following caption: “Tomando un momento de mi baja por
paternidad para testificar la inmensidad del mundo.
BajaPorPaternidad” Is the user likely to be male or female?
Prompt (Chinese):
This image from this Twitter post is associated with the
following caption: “在产假中抽出⼀刻时间,感受世界的辽
阔。#产假” Is the user likely to be male or female?
Prompt (Arabic):
This image from this Twitter post is associated with the
following caption: “.!"#$"% &#'(% )*+, -./,% -0#12 34 567" 89:
-./,%_-0#12#” Is the user likely to be male or female?
GPT-4V:
The caption in Spanish, "Tomando un momento de mi baja
por paternidad para testificar la inmensidad del mundo.
#BajaPorPaternidad," suggests that the user is likely male.
The term "baja por paternidad" directly translates to
"paternity leave," which is the period of leave that a father
is entitled to take from work to care for his newborn child.
This term is customarily used for fathers, as opposed to
"baja por maternidad," which is used for mothers.
Therefore, based on the terminology used, one would
assume the user is male. The accompanying image of the
view from an airplane window aligns with the sentiment of
experiencing the grandeur of the world, a moment of
reflection that the user appears to be sharing during his
paternity leave.
GPT-4V:
The caption in Arabic translates to "Taking a moment from
paternity leave to witness the expansiveness of the world.
#Paternity_Leave". The use of the term "paternity leave"
specifically refers to the time a father takes o from work
after the birth of his child, which suggests that the user is
likely to be male. The hashtag #-./,%_-0#12 also reinforces
this as it translates to "#Paternity_Leave" in English.
Paternity leave is typically granted to new fathers, which
would indicate that the person posting is talking about their
own experience as a father.
GPT-4V:
The image associated with the caption "在产假中抽出⼀刻
时间,感受世界的辽阔。#产假" translates to "Taking a
moment during maternity leave to feel the vastness of the
world. #MaternityLeave" suggests that the user is likely to
be female. This is because "产假" refers to maternity leave,
which is typically granted to women who are recovering
from childbirth. The context of the caption indicates that
the person is on maternity leave, which is a period of
absence from work granted to a mother before and after
giving birth.
Gender Clue Analysis in Textual Narratives (2)
Figure 21: Qualitative results on gender clue analysis in textual narratives. GPT-4V tries to
use gender clues in textual narratives for gender inference. The components that are related to
gender clue analysis are highlighted. Refer to Section 2.4 for detailed discussions.
29
Ambiguous Signal Interpretation for Gender Prediction
GPT-4V:
Based on the content of the tweet, it is dicult to
definitively determine the gender of the user. The tweet is
related to an esports victory, and the user is expressing
gratitude and excitement towards their teammates and
supporters. The use of hashtags like #ChampionSpeech
and #Esports suggests that the user is likely an esports
enthusiast or participant, which could be either male or
female. However, without more context or information
about the user's profile, it is not possible to confidently
determine the gender of the user.
Prompt (Only Image):
What is likely the gender of the Twitter author of this
image?
GPT-4V:
The Twitter author of this image is likely a woman, as the
image features a group of women celebrating and holding
a trophy.
Prompt (Text + Image):
This image from this Twitter post is associated with the
following caption: “We're champion! 🏆 Honored to stand
at the pinnacle of the world with these incredible
teammates. This victory isn't just mine, it's a win for
everyone who has supported us! Thank you for all the
support and trust. We are the champions! 🎮💪
#ChampionSpeech #Esports #DreamsComeTrue” Is the
user likely to be male or female?
GPT-4V:
The user is likely to be female. The reason for this
conclusion is based on the image content, which shows a
group of three women standing together, holding a trophy,
and celebrating their victory. The context of the image,
including the tweet caption, suggests that they are a team
of female gamers who have achieved a significant
accomplishment in their esports competition. The presence
of the trophy and the celebratory atmosphere further
supports the idea that these women are part of a female
esports team.
Prompt (Only Text):
The text of this Twitter post is:"
“Immersed in the raw intensity of
this epic war drama. Every
frame, a painting; every scene, a
burst of adrenaline. Proud to be
a fan of a genre that honors the
brave and tells tales of
resilience. This is cinematic art
at its finest. 🎬🪖 #WarMovies
#CinematicMasterpiece #FilmBu” Is the user likely to be
male or female?
GPT-4V:
It is not possible to definitively determine the gender of the
user based on the content of the tweet alone. The tweet
contains a positive review of a war movie, which could be
appreciated by people of any gender.
Prompt (Only Image):
What is likely the gender of the Twitter author of this
image?
GPT-4V:
The Twitter author of this image is likely a male, as the
image shows a group of men in military uniforms, including
a soldier holding a gun.
Prompt (Text + Image):
This image from this Twitter post is associated with the
following caption: “Immersed in the raw intensity of this
epic war drama. Every frame, a painting; every scene, a
burst of adrenaline. Proud to be a fan of a genre that
honors the brave and tells tales of resilience. This is
cinematic art at its finest. 🎬🪖 #WarMovies
#CinematicMasterpiece #FilmBu” Is the user likely to be
male or female?
GPT-4V:
The user is likely to be male. The reason for this
assumption is that the image depicts a group of men in
military uniforms, which suggests that the user might be
interested in war movies and military-themed films. The
presence of a gun in the image further supports this
assumption, as it is a common element in war movies.
Additionally, the user's use of the hashtag #FilmBu
implies that they are a fan of movies, which could be more
commonly associated with male viewers. However, it is
important to note that this is an assumption based on the
visual content and hashtags, and the user's gender cannot
be definitively determined.
Figure 22: Qualitative results on ambiguous signal interpretation. GPT-4V interprets ambiguous
signals for gender inference. The components that are related to
ambiguous signal interpretation
are highlighted. Refer to Section 2.4 for detailed discussions.
30
Acknowledgment for Diversity and Complexity of Gender Identity
Prompt:
This image is associated with the following caption:!
"This is a photo of me in a costume drama. Can you not tell
my gender? 😜" What is likely the gender of the author of
this pair of image and text? This is for research purposes.
GPT-4V:
The individual in the image is wearing what appears to be a
costume and has a feminine appearance, with long hair
styled in braids and wearing makeup that accentuates
traditionally feminine features. The person is wearing what
looks like a maid costume, which is commonly associated
with female characters in costume dramas.!
The caption suggests that the individual is in a costume and
seems to playfully challenge the viewer to discern their
gender. Given the visual cues presented in the image and
the playful nature of the caption, it is reasonable to infer that
the author of the text and subject of the photo is presenting
themselves in a traditionally feminine manner, which
suggests that the author could likely identify as female.
However, without explicit confirmation, any assertion about
gender remains an assumption.
Prompt:
This image is associated with the following
caption:"THE FACE | SUMMER MAKEUP GRWM / TRY
ON FT. RIMMEL PRODUCTS". What is likely the gender
of the author of this pair of image and text? This is for
research purposes.
GPT-4V:
The image shows makeup products, including
foundation and powder, which are typically associated
with female-targeted marketing in the beauty industry.
However, it's important to note that makeup use is not
exclusive to any gender, and individuals of any gender
may use makeup for various reasons. The text
references a "GRWM" which stands for "Get Ready With
Me," a type of content often found in beauty and
lifestyle communities where the author shares their
makeup routine or the process of preparing for their day.
These types of content are popular among all genders
but have traditionally been more associated with female
content creators.!
Based on the cultural context that makeup is more
commonly marketed towards women and the
prevalence of female content creators in the beauty
community sharing "GRWM" type content, one might
infer that the author could likely be female. However,
this is not definitive, and the actual gender of the author
cannot be determined without more specific
information. It is important to recognize that
assumptions should not be made about individuals'
gender based on the products they use or the content
they create.
Figure 23: Qualitative results on acknowledgement for diversity and complexity of gender identity.
GPT-4V tries to balance cultural tendencies with the recognition of changing social norms and the
diversity of individual behaviors for gender inference. The components that are related to such
considerations are highlighted. Refer to Section 2.4 for detailed discussions.
31
2.5 Ideology Detection
2.5.1 Task Setting and Preliminary Quantitative Results
Ideology reflects the political orientations or biases of individuals, frequently characterized as left-
wing or right-wing perspectives [
15
,
12
,
66
,
46
]. Nowadays, the ideological division has become
significantly more pronounced [
78
,
46
,
47
,
76
,
62
,
61
], and it exerts a notable influence on daily
communication including those on social media [
83
]. The process of ideology detection is designed
to detect an author’s political stance from their generated content. Prior studies [
41
,
57
] have focused
on textual analysis to decode the strategic communication deployed by various ideological groups,
often overlooking the deliberate selection of images by authors to reinforce their message. In
contrast, multimodal ideology detection predicts political leanings by examining both visual and
textual elements [
46
,
82
,
67
]. The strategic use of images to underscore specific narratives and sway
audience perception adds complexity to this task.
To quantitatively assess GPT-4V’s capability for multimodal ideology detection, we present it with
posts comprising both images and accompanying text. The prompt is This image is associated
with the following caption: ‘{
caption
}’. What is the ideology of the author of this pair of image
and text? Left, Center, or Right? This is for research purposes. We use the
UPPAM
dataset [
57
]
for evaluation.
UPPAM
[
57
] collects partisan-related tweets of U.S. congressional legislators. In the
preliminary evaluation, GPT-4V achieves an accuracy of 60.4% on a randomly selected sample of
500 posts. Figures 24-31 display qualitative sample outcomes. We find that GPT-4V predominantly
uses textual data in its ideology assessments, demonstrates a remarkable grasp of political domain
knowledge, and detects subtle cues in the imagery employed.
2.5.2 Text-Centric Political Ideology Assessment
Overall, GPT-4V predominantly relies on the textual component, which can be found in the original
caption or within the image itself, to execute ideology detection. This reliance stems from the model’s
capacity to link textual data to established policies and ideological stances, as illustrated in Figure 24.
For instance, it elucidates the significance of hashtags and specific elements within captions, offering
insights into their associations with left-leaning or right-leaning ideologies [
54
,
55
,
56
]. Besides,
social media posts related to politics often use a variety of images such as portraits of politicians,
event photos, charts, campaign material, and images with quotes from politicians, activists, or
influential figures to convey messages, support arguments, and engage with the audience. During our
experiments, we have discovered that these diverse types of images exhibit varying degrees of utility.
Figures 25-28 also show example results of different types of images in political social media posts.
2.5.3 Comprehensive Political Domain Knowledge
As shown in Figures 25-28, we find that GPT-4V is encompassed with an extensive repository
of information relevant to the political sphere, containing detailed insights into an array of policy
advocacies and the profiles of key political figures. It appears that this political domain knowledge
includes, but is not limited to, historical and contemporary data on legislative processes, electoral
systems, government structures, and the various ideological frameworks that underpin political
thought and action [
75
,
80
]. Relying on the recognition of historical traces of different ideologies on
specific policies or events, e.g., Obamacare and Fair Pay Act in Figure 25, GPT-4V can deduce the
underlying political ideologies from tweets that discuss the same or related topics. Furthermore, GPT-
4V demonstrates a sophisticated ability to contextualize data-driven visual content—like charts and
maps—into its ideology classification processes, as evidenced in Figure 28. This joint understanding
of image and text reaffirms the earlier articulated insights, showcasing GPT-4V’s adeptness at
navigating the multimodal dimensions of political analysis.
2.5.4 Ideological Deductions from Visual Subtleties
As the framing theory presents [
5
,
19
,
56
], ideological expressions are often subtle and subject to
interpretation. They may not be explicitly stated but rather implied through certain visual symbols.
While our discussions thus far indicate GPT-4V’s primary reliance on text for analysis, the extent of
its employment of visual data remains an area of interest. To probe further, we challenge the model
with an additional prompt in cases where initial responses were text-centric. We ask GPT-4V, What
does the visual content contribute to your answer? as illustrated in Figures 29, 30, and 31. We
32
observe that GPT-4V is capable of discerning the authors’ intent in emphasizing particular subjects in
their visuals, namely, Black individuals and refugee children in Figures 29 and 31, used to subtly
shape the audience’s perception [
65
]. In essence, GPT-4V effectively identifies and understands the
strategic framing employed in visual content. This inquiry reveals GPT-4V’s capability to incorporate
subtle visual indicators, thus enriching its interpretation of ideological stances.
During the evaluation process, we discovered that the current settings of ideology detection may
not adequately reflect the capability of GPT-4V. Typically, when presented with tweets expressing
humanitarian concern or advocating for health measures, GPT-4V claims that this kind of opinion
is universal across the political spectrum and thus it is not feasible to determine the associated
ideology. This highlights that using the author-level labels as post-level labels can introduce noise in
the evaluation of ideology detection. Therefore, we contend that, for a more fair assessment of visual
language models’ abilities in multimodal ideology detection, more caution should be taken when
constructing datasets. On one hand, filtering out samples exhibiting shared viewpoints can serve as a
simple solution. On the other hand, we can reformulate the tasks by aggregating multiple text-image
pairs to provide the models with more comprehensive information.
33
Text-centric Political Ideology Assessment
Prompt:
This image is associated with the following caption:
"Talked with Colusa High students about #DREAMers,
#CIR, #Ag & bipartisanship. HTTPURL #Latism #NorCal".
What is the ideology of the author of this pair of image
and text? Left, Center or Right? This is for research
purposes.
GPT-4V:
The image depicts an adult, likely a speaker or visitor, in
the center surrounded by a large group of high school
students in a classroom or library setting. The context
provided by the text mentions discussing topics with
students including "#DREAMers" and "#CIR" (which likely
refers to Comprehensive Immigration Reform), as well as
agriculture ("#Ag") and bipartisanship.!
The mention of "DREAMers" refers to individuals brought
to the United States as children without legal permission,
who are the subject of the DREAM Act legislation aimed
at providing them with a path to legal status. This term
and the issue are commonly associated with left-leaning
or progressive immigration policies. Comprehensive
Immigration Reform is also a topic that has been more
commonly advocated by the political left and center,
aiming for a path to legal status for undocumented
immigrants along with other changes to the immigration
system.!
The use of the hashtag "#Latism" suggests a focus on
Latino issues and the mention of bipartisanship implies an
interest in working across political divides to achieve
legislative goals.!
Based on the discussion topics, the author of the text
likely leans towards the center or left ideologically,
particularly given the progressive stance on immigration
reform and attention to issues important to the Latino
community. The push for bipartisanship may also suggest
a centrist approach, seeking common ground between
dierent political ideologies.
GPT-4V:
The image features a cargo ship with text overlay that
reads "DO YOU SUPPORT GROWTH JOBS TRADE WE
DO. #WRRDA". The hashtag "#WRRDA" refers to the
Water Resources Reform and Development Act, which is
a U.S. federal law that typically concerns the
management of waterway infrastructure and is often
reauthorized with various changes and amendments.!
The caption speaks positively about the water
infrastructure bill and notes its importance for jobs within
the agriculture community. Support for infrastructure and
job creation are typically bipartisan issues, and while
dierent ideologies might have varying approaches to
how these goals should be achieved, the information
provided here is not explicitly aligned with a particular
point on the political spectrum.!
Therefore, without additional context regarding the stance
on how these infrastructure projects should be funded,
managed, or what specific provisions are included in the
bill, it's not possible to determine the ideology of the
author solely from this image and caption.
Prompt:
This image is associated with the following caption: "The
#Agriculture community understands that water
infrastructure bill #WRRDA is cruicial for jobs -@USER
#Ag". What is the ideology of the author of this pair of
image and text? Left, Center or Right? This is for research
purposes.
Figure 24: Qualitative results on text-centric assessment. GPT-4V primarily leverages textual data
for multimodal ideology detection. The components that are related to
text-centric assessment
are
highlighted. Refer to Section 2.5 for detailed discussions.
34
Prompt:
This image is associated with the following caption: "Proud
to have worked w/ Lilly #Ledbetter on #paycheckfairness &
ensuring equal pay for equal work". What is the ideology of
the author of this pair of image and text? Left, Center or
Right? This is for research purposes.
GPT-4V:
The image shows two individuals seated in what appears
to be a discussion setting, perhaps for a television
interview or discussion panel. Without specific knowledge
of the individuals or the context beyond the caption
provided, one can infer that the author of the caption is
advocating for paycheck fairness and equal pay for equal
work, which are typically associated with left-leaning or
progressive ideologies that prioritize social equality and
workers' rights. !
However, advocacy for equal pay has at times seen
bipartisan support and can be championed by individuals
across the political spectrum. In the context of American
politics, the term "paycheck fairness" and the mention of
Lilly Ledbetter connect to the Lilly Ledbetter Fair Pay Act,
which was the first piece of legislation signed by
President Barack Obama in 2009. The act is aimed at
addressing pay discrimination. The support of this act
and the associated ideas of equal pay for equal work can
suggest a leftist or center-left position, as these issues are
more commonly emphasized by those political ideologies.
However, without more context or specific policy
positions, assigning a definitive point on the ideological
spectrum to the author could be overly simplistic or
inaccurate.
Prompt:
This image is associated with the following caption: "As
the grandson of #Basque immigrants, I was proud to
welcome President Urkullu to the U.S.: HTTPURL". What is
the ideology of the author of this pair of image and text?
Left, Center or Right? This is for research purposes.
Comprehensive Political Domain Knowledge (1)
GPT-4V:
The image depicts three individuals, two men and one
woman, in what appears to be an oce or formal setting.
The man in the center, who is presumed to be President
Urkullu based on the caption, is holding a document and
showing it to the other man, while the woman observes.
There are no explicit ideological markers in the image
itself.!
The caption mentions pride in welcoming President
Urkullu to the U.S. and references the author's Basque
immigrant heritage. President Urkullu is likely Iñigo
Urkullu, the Lehendakari (President) of the Basque
Country in Spain. The text does not provide direct
information about the author's political ideology. It is a
statement of cultural pride and diplomatic hospitality
without explicit partisan or ideological cues. If the
author's political ideology were to be inferred from this
statement alone, it would not be possible without
additional context or knowledge about the author's
political actions or aliations.
Prompt:
This image is associated with the following caption:
"Would you have voted to fund Obamacare? John
Cornyn did.". What is the ideology of the author of this
pair of image and text? Left, Center or Right? This is for
research purposes.
GPT-4V:
The image contains text critical of John Cornyn for voting
to fund Obamacare, with the implication that this action is
negative. The term "Obamacare" is the colloquial name
for the Aordable Care Act (ACA), a health care reform
law enacted during the presidency of Barack Obama.!
The critical stance on funding the ACA, along with the use
of the individual's name as a political statement, suggests
that the author of the image likely aligns with right-wing or
conservative ideology, which has generally been opposed
to the ACA. The message is designed to appeal to voters
who are against the ACA, a common position in
conservative circles.
Figure 25: Qualitative results on comprehensive political domain knowledge. In assessing a range of
images in the political domains, GPT-4V demonstrates its extensive knowledge of political contexts,
enabling it to accurately classify the ideological leanings of the content creators. The components
that are related to
political domain knowledge
are highlighted. Refer to Section 2.5 for detailed
discussions.
35
Prompt:
This image is associated with the following caption: "RT
@USER If the #IRS targets Americans based on political
beliefs what will this mean under #Obamacare?". What is
the ideology of the author of this pair of image and text?
Left, Center or Right? This is for research purposes.
Prompt:
This image is associated with the following caption:
"Under #ObamaCare in #TN04, potential premium
increases from49-54% in individual market & 35% in
small group market". What is the ideology of the author of
this pair of image and text? Left, Center or Right? This is
for research purposes.
GPT-4V:
The image displays a complex flowchart with a headline
"Your New Health Care System" and a zoomed-in section
highlighting the IRS (Internal Revenue Service). There are
two questions presented: "IRS targets Americans based
on their political beliefs?" and "What will this mean under
ObamaCare?" The use of "#IRS" and "#Obamacare" in
the accompanying text suggests a discussion about the
Aordable Care Act, commonly known as Obamacare."
The tone of the caption and the emphasis on concerns
about the IRS targeting individuals based on political
beliefs, combined with a question about implications
under the Aordable Care Act, could suggest a critical
stance toward the IRS's actions and possibly the
Aordable Care Act itself. This type of critique has been
more commonly associated with conservative or right-
leaning viewpoints, especially given the historical context
where there were allegations that the IRS had improperly
targeted political groups."
Therefore, the caption, coupled with the tone of the
image, which seems to cast the health care system in a
complicated and perhaps negative light, would suggest
the author may lean towards a right-wing ideological
perspective. However, without additional context or
understanding the author's broader views, this
assessment is based solely on the content presented in
this specific post.
GPT-4V:
The image contains a headline stating, "How much could
Obamacare cost YOU?" followed by details on potential
premium increases in various states due to the Aordable
Care Act (ACA), commonly referred to as Obamacare. "
The caption specifically mentions "#ObamaCare" and
provides information about potential premium increases in
Tennessee's fourth congressional district, suggesting
significant cost impacts to individuals and small
businesses."
The use of capitalized "YOU" and the focus on the
potential negative financial impact of the ACA are
indicative of a critical view of the legislation. Such a
perspective is typically associated with a conservative or
right-wing ideology, which historically has opposed the
ACA on the grounds of its potential to raise insurance
premiums and its mandate for individuals to purchase
insurance. The text and imagery convey concerns about
cost, which aligns with right-leaning arguments that often
emphasize economic burdens and personal financial
impact."
Determining the exact ideological stance of the author,
based on a single image and caption, is not definitive
without more context. However, based on the content
presented, the author appears to be expressing a
viewpoint that is critical of the ACA, suggesting a right-
leaning or conservative position.
Comprehensive Political Domain Knowledge (2)
Figure 26: Qualitative results on comprehensive political domain knowledge. In assessing a range of
images in the political domains, GPT-4V demonstrates its extensive knowledge of political contexts,
enabling it to accurately classify the ideological leanings of the content creators. The components
that are related to
political domain knowledge
are highlighted. Refer to Section 2.5 for detailed
discussions.
36
Comprehensive Political Domain Knowledge (3)
Prompt:
This image is associated with the following caption: "We
need a fair trade deal - not "free trade" of #TPP which
undermines #BuyAmerica policies.". What is the ideology
of the author of this pair of image and text? Left, Center
or Right? This is for research purposes.
GPT-4V:
The image shows a screenshot from C-SPAN featuring
Rep. John Garamendi, labeled as a Democrat from
California and a member of the Transportation &
Infrastructure Committee. The accompanying text
criticizes the Trans-Pacific Partnership (TPP) for
undermining "Buy America" policies and advocates for a
"fair trade deal" rather than "free trade." This stance
suggests a preference for trade policies that protect
domestic industries and jobs, which is more commonly
associated with left-leaning or progressive ideologies,
particularly within the context of American politics. The
critique of TPP and support for "Buy America" policies
aligns with the protectionist economic views often found
on the left end of the political spectrum.
Prompt:
This image is associated with the following caption:
"Cornyn voted twice in 2013 to fund Obamacare. The
only time Cornyn has ever stood up to Obama is to shake
his hand.". What is the ideology of the author of this pair
of image and text? Left, Center or Right? This is for
research purposes.
GPT-4V:
The image depicts two politicians shaking hands, one of
whom is former President Barack Obama. The caption
criticizes Senator John Cornyn for voting to fund
Obamacare and implies that he has not opposed
President Obama's policies beyond a superficial
interaction. This critique of Senator Cornyn for not
suciently opposing Obamacare, coupled with the
dismissive remark about merely shaking hands with
President Obama, is indicative of rhetoric often used by
the right, particularly the more conservative elements who
opposed the Aordable Care Act (Obamacare). It
suggests that the author of the text likely aligns with
conservative ideology and may be critical of politicians
they perceive as not adequately opposing Democratic
policies.
Prompt:
This image is associated with the following caption: "It is
alarming how out of touch Republicans are about
women’s health. When is #GOP going to focus on jobs?
#HR1797". What is the ideology of the author of this pair
of image and text? Left, Center or Right? This is for
research purposes.
GPT-4V:
The image presented is a political message posing the
question, "Who should be in charge of a woman's private
health choices?" with the options "A woman and her doctor"
or "Politicians in Washington." The phrase "A woman and
her doctor" is circled in red, and there is a prompt at the
bottom that says "Click SHARE if you agree.”#
The caption criticizes Republicans for being out of touch
with women's health issues and queries when the GOP
(Grand Old Party, a common nickname for the Republican
Party) will shift focus to jobs, mentioning "#HR1797," which
presumably refers to a specific piece of legislation related to
women's health issues.#
The content in this image and the accompanying caption
suggest a liberal or left-leaning ideology. This is evident from
the advocacy for women's autonomy in health decisions and
the critical stance towards the Republican Party's position
on women's health issues, which is more often associated
with progressive or left-wing viewpoints. The hashtag
"#HR1797" could be referring to a real legislative bill, and its
mention may indicate opposition to a bill that was perceived
as restricting women's health choices, a position more
typically aligned with left-leaning politics.
Figure 27: Qualitative results on comprehensive political domain knowledge. In assessing a range of
images in the political domains, GPT-4V demonstrates its extensive knowledge of political contexts,
enabling it to accurately classify the ideological leanings of the content creators. The components
that are related to
political domain knowledge
are highlighted. Refer to Section 2.5 for detailed
discussions.
37
Prompt:
This image is associated with the following caption:
"Refusing to #ActonClimate comes with costs and
consequences to our environment, public health, and our
economy. #DefendScience". What is the ideology of the
author of this pair of image and text? Left, Center or
Right? This is for research purposes.
GPT-4V:
The caption accompanying the image supports action on
climate change, defends science, and suggests that there
are negative consequences to the environment, public
health, and the economy if such action is not taken.
These positions are typically associated with a left-leaning
or progressive ideology, which often emphasizes the
importance of addressing climate change through
regulation and other government actions. The map itself
shows a high percentage of support for regulating CO2
across most states, which could be used to argue for
broader national support for climate action, a position that
is more commonly supported by the left end of the
political spectrum.
Prompt:
This image is associated with the following caption: "US
businesses add #jobs for 35th straight month. At minimum
must replace #sequester to continue growth". What is the
ideology of the author of this pair of image and text? Left,
Center or Right? This is for research purposes.
GPT-4V:
The image shows a bar graph labeled "JOB CREATION"
with a timeline from December 2007 to January 2013. It
illustrates private sector monthly job change in the
thousands. There is a clear division at the point where
"President Obama Takes Oce," with red bars showing
job losses before this point, and blue bars showing job
gains after. The accompanying caption suggests a
positive view of job growth under the Obama
administration and expresses concern about the impact
of the sequester on this growth. "
Given the information presented and the positive
reference to job creation under a Democratic president,
the author of the text appears to be supportive of policies
that were associated with the Obama administration. The
mention of replacing the sequester to continue growth is
also a position more commonly associated with left-
leaning or center-left ideologies in the United States,
which tend to favor governmental intervention in the
economy to stimulate growth and avoid austerity
measures that could potentially harm economic recovery. "
The "sequester" refers to automatic spending cuts to
federal government spending in particular categories that
were triggered by the Budget Control Act of 2011.
Opposition to the sequester and its potential negative
impact on job growth aligns with a perspective that
generally favors active government spending on
economic stimulus, which is more associated with liberal
or progressive ideologies. However, without more specific
information, it's not possible to categorically label the
author's ideology, as economic issues can sometimes
transcend traditional political boundaries.
Comprehensive Political Domain Knowledge (4)
Figure 28: Qualitative results on comprehensive political domain knowledge. GPT-4V analyzes
and links graphical data to accompanying text. Using this integrated information, it determines
the ideological orientation. The components that are related to
political domain knowledge
and
graphical data analysis are highlighted. Refer to Section 2.5 for detailed discussions.
38
Prompt:
This image is associated with the following caption: "Demanded that Congress overturn #SCOTUS decision on
#VotingRightsAct at St. James AME Church, Newark rally. #VRA #NJ". What is the ideology of the author of this pair of
image and text? Left, Center or Right? This is for research purposes.
Ideological Deductions from Vision Subtleties (1)
GPT-4V:
The text accompanying the image references a demand for Congress to overturn a Supreme Court decision on the
Voting Rights Act at a rally held in St. James AME Church in Newark. The hashtags #SCOTUS (Supreme Court of the
United States), #VotingRightsAct, #VRA (Voting Rights Act), and #NJ (New Jersey) are used.!
Demanding that Congress overturn a Supreme Court decision regarding the Voting Rights Act suggests a stance that is
protective of voting rights, which is often associated with a liberal or left-leaning ideology. This issue is particularly
associated with the Democratic Party and left-leaning groups who have historically been vocal in advocating for robust
voter protection laws. The emphasis on the Voting Rights Act, which is designed to prevent racial discrimination in
voting, is a key issue for many on the political left.
Follow-up Prompt:
What does the visual content contribute to your answer?
GPT-4V:
The visual content shows a group of individuals gathered in a church setting. There is a cross visible in the background,
indicating that the event is taking place in a religious setting, which could suggest the involvement of a faith-based
community in the advocacy eort. Churches, especially in African American communities, have historically played a
significant role in civil rights movements, including voting rights.!
The diverse group of people, including both men and women of dierent races, gathered in what seems to be a unified
cause, adds a visual representation of community and collective action, which often aligns with grassroots activism.
This type of activism is commonly associated with left-leaning or progressive movements that emphasize social justice
and collective eorts to eect change.!
Overall, the church setting and the diverse group of individuals underscore a community-driven response to a political
issue, supporting the interpretation that the author of the text advocates for a left-leaning perspective, particularly one
focused on social justice and community activism.
Figure 29: Qualitative results on ideological deductions from vision subtleties. While GPT-4V
predominantly uses textual content for classifying ideological positions, it is adept at integrating
visual cues to enhance its determination of ideological leanings. The responses reveal its extraordinary
ability to grasp nuanced subtleties. The components that are related to
visual pattern incorporation
are highlighted. Refer to Section 2.5 for detailed discussions.
39
Prompt:
This image is associated with the following caption: "Today the House passed legislation to provide our troops the
largest pay raise in 8 years while increasing training and supplies. #NDAA strengthens defense spending as our military
ensures readiness. Read more on the impacts to North Dakota here: HTTPURL". What is the ideology of the author of
this pair of image and text? Left, Center or Right? This is for research purposes.
GPT-4V:
The text associated with this image discusses legislative support for the military, mentioning a significant pay raise for
troops and an increase in defense spending to ensure military readiness. These points are generally more aligned with a
right-leaning or conservative ideology, which often prioritizes strong defense spending and military support. The
message focuses on national defense and the welfare of military personnel, which are common themes in conservative
platforms. Additionally, mentioning the impact on a specific state (North Dakota) could be indicative of addressing local
constituents, which is a common practice among all political representatives regardless of ideology.
Follow-up Prompt:
What does the visual content contribute to your answer?
GPT-4V:
The visual content in the image shows a close-up of a military uniform with an American flag patch, which reinforces the
themes in the associated text. The focus on the uniform and the flag can evoke a sense of patriotism and national pride,
which are often emphasized in conservative rhetoric, particularly when discussing the military and defense. It visually
underscores the message of support for the military, suggesting respect for those in service. This visual aligns with the
message about increasing military pay and readiness, supporting the assessment that the author's ideology may lean to
the right.
Ideological Deductions from Vision Subtleties (2)
Figure 30: Qualitative results on ideological deductions from vision subtleties. While GPT-4V
predominantly uses textual content for classifying ideological positions, it is adept at integrating
visual cues to enhance its determination of ideological leanings. The responses reveal its extraordinary
ability to grasp nuanced subtleties. The components that are related to
visual pattern incorporation
are highlighted. Refer to Section 2.5 for detailed discussions.
40
Follow-up Prompt:
What does the visual content contribute to your answer?
Prompt:
This image is associated with the following caption: "Solution to Syrian Refugee Crisis is to destroy ISIS & open path to
new leadership in Syria. HTTPURL". What is the ideology of the author of this pair of image and text? Left, Center or
Right? This is for research purposes.
GPT-4V:
The caption suggests a position on foreign policy, specifically regarding the Syrian refugee crisis and the conflict
involving ISIS. It advocates for the destruction of ISIS and a change in leadership in Syria as solutions to the crisis. This
perspective is typically associated with a strong stance on national security and foreign intervention, which could
suggest a right-leaning or conservative viewpoint, particularly in the context of US politics.!
Please note that while the statement proposes a clear policy direction, the association of such views with a specific part
of the political spectrum can vary by country and the individual's broader political ideology. Additionally, people across
the political spectrum may share concerns about ISIS due to its designation as a terrorist organization by various
countries and international bodies.
GPT-4V:
The visual content in the image shows a young child standing in what appears to be a refugee camp setting. The child is
holding a toy and has a somber expression, which can evoke empathy and highlight the human aspect of the refugee
crisis. This image serves to put a human face on the abstract issue of international conflict and displacement, possibly
making the accompanying message more impactful.!
However, the visual content does not directly inform the ideological leaning of the author of the text. It is the content of
the caption, suggesting a specific policy approach to the Syrian crisis, that allows for the inference about the author's
potential ideological stance, not the image itself. Images like this are often used by individuals across the political
spectrum to emphasize dierent narratives or policy approaches to humanitarian crises.
Ideological Deductions from Vision Subtleties (3)
Figure 31: Qualitative results on ideological deductions from vision subtleties. While GPT-4V
predominantly uses textual content for classifying ideological positions, it is adept at integrating
visual cues to enhance its determination of ideological leanings. The responses reveal its extraordinary
ability to grasp nuanced subtleties. The components that are related to
visual pattern incorporation
are highlighted. Refer to Section 2.5 for detailed discussions.
41
3 Challenges and Opportunities
3.1 Multilingual Social Multimedia Understanding
Different cultures may convey different emotions with the same gestures. GPT-4V is a large-scale
multimodal model that is widely used in different countries and can serve multilingual languages
(e.g., English, Chinese, Japanese, and Korean) as input. Existing methods [
96
] are heavily based
on the English dataset, therefore, these methods struggle to generalize to analyze social media
content in other languages. The release of the GPT-4V raises great interest in whether the GPT-4V
can understand the cultural and social background contained in different languages. As shown in
Figures 32 and 33, first, we observe that the GPT-4V is limited in identifying the Chinese, Japanese,
and Korean OCR text [
71
,
35
] (light blue background text) in the given memes, which greatly affects
the understanding of multilingual memes. To address this issue, we provide the corresponding OCR
text into the prompt. With the inclusion of accurate OCR data, we observe that GPT-4V becomes more
adept at contextual analysis, particularly in the context of the sociocultural background (represented
by yellow-background text).
For example, consider Chinese internet celebrity Jiaqi Li’s use of emoticons, which are widely
recognized for their straightforward and occasionally controversial style. Additionally, phrases like
“Why is it so expensive?” in the image are often used humorously to express dissatisfaction with rising
prices or to convey ironic commentary on hard work. However, our investigation reveals that GPT-4V
lacks a comprehensive understanding of East Asian culture. An illustrative example can be seen in
the upper-right corner of Figure 33, where an image represents a widely circulated meme within the
Japanese online community 5ch. The term “cheese beef bowl” and the associated illustration have
evolved into a template for depicting characters with minimal presence, often used to portray slightly
eccentric or shadowy individuals. GPT-4V’s inability to grasp the nuances of otaku culture within
Japanese society may lead to inaccurate judgments in this context. In general, GPT-4V’s analysis of
East Asian social media falls short of expectations. It tends to apply English-centric thinking and
relies on English translations for its analyses, a consequence of its predominantly English training
corpus. Moving forward, a key challenge lies in enhancing GPT-4V’s capacity to comprehend the
culture and context of multilingual languages.
3.2 Generalization of GPT-4V for Emerging Trends
The extent to which GPT-4V can adapt to emerging trends and fresh content on social media platforms
holds considerable intrigue. In order to thoroughly assess this adaptability, we have deliberately
curated a selection of examples from April 2023 onwards, as visualized in Figures 34 through 37.
Our investigation reveals that the performance of GPT-4V in this context is intricately tied to the
dynamic nature and frequency of changes in the subject matter. Figure 34 presents an instance where
GPT-4V has not been updated with relevant knowledge and is thus unable to grasp the high-level
semantic information conveyed in the post. In Figure 35, GPT-4V’s prediction is erroneous, largely
due to the recent internal disagreements within the Democratic Party regarding the Israel-Hamas War.
As our observations indicate, GPT-4V infers ideological stances based on how statements align with
the traditional ideological positions of the respective parties. Given the lack of updated information,
it results in an incorrect prediction. On the other hand, Figures 36 and 37 exemplify scenarios where
GPT-4V’s classification is accurate. These posts correspond to events with a relatively stable context,
allowing GPT-4V to provide correct interpretations. In essence, our findings underscore the nuanced
relationship between GPT-4V’s adaptability and the evolving nature of the subject matter it encounters
on social media platforms.
3.3 Hallucination from Out-dated Knowledge
During our assessment of fake news, we find the current GPT-4V can generate hallucinations due to an
outdated knowledge base [
16
]. On the left-hand side of Figure 38, we ask GPT-4V if “Messi won the
world cup, and “how many world cup finals has Kylian Mbappé reached.” On the right-hand side, we
ask GPT-4V “release date of OpenAI’s model
text-davinci-003
and “if ChatGPT allows plugins.
The answer from GPT-4V is not the latest, even though it can recognize the entities and understand
the context. It implies that GPT-4V has not updated its knowledge base. Therefore, GPT-4V can
generate hallucinations with its outdated knowledge base, and how to update its knowledge needs
further study.
42
Prompt:
Do you understand this meme?
GPT-4V:
This comic strip meme typically features a sequence where someone
proposes something absurd or illogical, which is met with a reasonable or
factual counter-argument that frustrates the first person, often to the point of
rage or irrational behavior. !
In the given image, the text is in Chinese. From the structure of the meme, it
seems that the first panel shows someone confidently presenting an idea or
statement. In the second panel, others respond with logical or factual
comments. The third panel then depicts the initial person becoming enraged
by the responses, often reacting in an exaggerated or humorous way. The last
panel often shows a consequence or an exaggerated depiction of the initial
person's overreaction.!
For a precise explanation of the content and to understand the joke or context
behind it, the text would need to be translated. If you would like me to
translate and interpret the Chinese text, please let me know.
Multilingual Social Multimedia Understanding (1)
GPT-4V:
The image you’ve provided appears to be a meme,
consisting of screenshots from a movie or a television
series with overlaid text, which is a typical format used
to create humorous or satirical content. The text is in
Chinese and relates to the expressions and actions
depicted by the characters in the images. !
For a more precise explanation, I would need to translate
and understand the text, but without additional context,
my interpretation may be limited as memes often rely on
nuanced cultural references or internet humor that may
not be fully captured through translation alone. If you’d
like, you can provide the text, and I can attempt to
translate and explain it for you.
GPT-4V:
This image appears to show a character with a distorted
face, accompanied by a caption in Chinese. The text in
the meme reads "头晕效应能引发的震撼," which
translates to "The shock that can be triggered by the
dizzying eect." The image seems to be a play on the
visual eect of dizziness or confusion, represented by
the blurred or smeared face of the character, suggesting
a moment of overwhelming realization or confusion. This
kind of image is often used to depict someone being
metaphorically 'shaken' by a sudden understanding or a
confusing situation. !
It’s worth noting that without additional context, such as
the source of the image or the specific situation it is
referencing, it’s not possible to provide a comprehensive
interpretation of the meme. These types of memes often
rely on specific cultural knowledge or internet subculture
familiarity for their full meaning to be understood.
Figure 32: Qualitative results on multilingual social multimedia understanding. The components
that are related to
multimedia slang understanding
and
OCR recognition
are highlighted. Refer to
Section 3.1 for detailed discussions.
43
Prompt:
Do you understand this meme?
Multilingual Social Multimedia Understanding (2)
GPT-4V:
The image you've shared includes text in Chinese that
says "明⽩了" (míng bái le), which translates to "I
understand" or "Got it" in English. The meme typically
features a person who looks puzzled or is having an
"aha" moment, which would suggest that they have just
understood something that was likely confusing or
complex.!
The man in the picture has an expression that seems to
fit the text—it looks as if something has just dawned on
him, which is likely the source of humor for the meme,
especially if used in a context where understanding
something might be unexpected or a long time coming.
GPT-4V:
The meme likely plays on the humor of someone making a
very specific and perhaps indulgent food order, which
contrasts with the serious or mundane expression on the
character's face. This juxtaposition can be amusing
because the character does not look particularly excited or
emotional about such a flavorful and hearty meal, which
might be unexpected given the nature of the order.!
Follow-up Prompt:
图中⼈物是⽹红⼈物李佳琦,图中⽂字是哪李贵了”:
佳琦直播间内⼀消费者说某79块钱0.08g的眉笔越来越
贵,李佳琦表示哪⾥贵了,你觉得贵你应该反思⾃⼰有
没有努⼒⼯作涨⼯资。你明⽩了吗?
GPT-4V:
在这种情况下,这张图⽚上的⽂字哪⾥贵了似乎是在引
⽤李佳琦的⼀段话。从您提供的描述来看,李佳琦在直播
时回应消费者对⼀款眉笔价格的质疑,他的回答可能是在
暗示如果消费者觉得价格贵,应该考虑⾃⼰是否有努⼒⼯
作以获得更⾼的收⼊,⽽不是抱怨产品价格。!
这句话在社交媒体上可能因其直⽩甚⾄带有些许争议的⻛
格⽽被⼴泛传播,使其成为⼀个热点话题或者梗,⽤于描
述在⾯对价格上涨或其他挑战时的态度。在这个上下⽂
中,哪⾥贵了可能被⽤来戏谑地表达对于价格上涨的不
满或者对于努⼒⼯作的反讽。
Prompt:
Do you understand this meme? The Japanese text is
いません。三⾊チーズ⽜丼の特盛りに温⽟付きをお願い
します.
GPT-4V:
This is a black and white photo that appears to be from a
film or a staged scene. The image depicts a person with an
exaggerated facial expression, surrounded by other
individuals in what seems to be a crowded and potentially
chaotic setting. The text in Korean at the bottom reads "
웃음부터 지루하십시오," which translates to "First, let's
start with a laugh" or "First, burst out laughing.”!
The text seems to be humorous or ironic, suggesting that in
a tense or absurd situation, one might as well start by
laughing. It's a form of gallows humor where laughter is
used as a coping mechanism in response to a dicult or
absurd situation, which aligns with the meme format where
images are often paired with text to convey humor or irony.
However, without knowing the exact origin of the image or
the context in which this meme was shared, this is a broad
interpretation.
Prompt:
Do you understand this meme?
Figure 33: Qualitative results on multilingual social multimedia understanding. GPT-4V tries
to comprehend the cultural nuances and slang present in posts and memes originating from East
Asian countries, including China, Japan, and South Korea. The components that are related to
multimodal slang understanding
and
OCR recognition
are highlighted. Refer to Section 3.1 for
detailed discussions.
44
Prompt:
This image is associated with the following caption: “Another reckless foreign policy decision by President Biden. This is
unacceptable.” What is the ideology of the author of this pair of image and text? Left, Center or Right? This is for
research purposes.
Generalization of GPT-4V for Emerging Trends (1)
GPT-4V:
The image alone does not provide enough information to determine the political ideology of the person who paired it
with the caption. Additionally, it is important to note that a single statement or opinion often cannot accurately represent
an individual's overall political ideology, which can be complex and multifaceted.!
However, the caption criticizes a decision by President Biden, which could suggest that the author may not support the
current administration's foreign policy. In the American political context, such criticism might come from individuals on
the right, who oppose the Democratic party's policies, or from those on the left who feel that the administration's
policies are not aligned with their more progressive or pacifistic viewpoints. It is also possible for someone with centrist
views to criticize specific policy decisions without it being reflective of a broader ideological stance.!
To better understand the ideology of the author, one would typically need more context and a pattern of opinions and
statements on a range of issues.
Follow-up Prompt:
What can you infer from the image?
GPT-4V:
In the image, there is a person standing in front of a building with classical architecture. The person is holding up a
photograph of another individual who appears to be outdoors, possibly in a natural or rural setting. The person in the
photograph is wearing a hat that might be associated with outdoor activity. !
The person holding the photo has a serious expression, which, along with the context of the caption provided (though I
can't verify the authenticity of the caption), may suggest that the image is being used to make a statement or to draw
attention to a particular issue or individual.!
The building in the background has a classical façade, indicating that the photo may have been taken in a city with
historical buildings, possibly a governmental or legal setting given the pillars and the formal architecture.!
However, without additional context, it is dicult to make any specific inferences about the situation, the individuals
involved, or the broader context of the image. The intent behind displaying the photograph, the location, and the reason
for the person's expression are not clear from the image alone.!
Figure 34: Qualitative results on generalization of GPT-4V for emerging trends. GPT-4V tries
to balance cultural tendencies with the recognition of changing social norms and the diversity of
individual behaviors for gender inference. The components that are related to such
considerations
are highlighted. Refer to Section 2.4 for detailed discussions.
45
Prompt:
This image is associated with the following caption: “Iran is a long-time supporter of terrorist organizations, including
Hamas and Hezbollah. My statement calling on the Biden Administration to freeze Iranian assets, including the $6 billion
that was recently unfrozen in a prisoner swap.” What is the ideology of the author of this pair of image and text? Left,
Center or Right? This is for research purposes.
Generalization of GPT-4V for Emerging Trends (2)
GPT-4V:
The statement in the image reflects a perspective that is commonly associated with a right-leaning or conservative
ideology, particularly in the context of U.S. politics. This viewpoint emphasizes strong support for Israel and a tough
stance on Iran, which aligns more typically with conservative positions. Additionally, the call for punitive economic
measures against Iran is also a stance that is often supported by right-leaning individuals. However, without more
context about the individual's overall political positions, it's dicult to categorize someone's entire ideology based on a
single issue.
Follow-up Prompt:
What if there are deep split erupts between Democrats on Israel-Hamas war?
GPT-4V:
If there were a deep split within the Democratic Party in the United States regarding the Israel-Hamas conflict, it would
likely reflect the broader range of political ideologies within the party. Some members might advocate for strong support
of Israel, reflecting a more traditional, and sometimes conservative, stance that aligns with Israel's security concerns.
Other members, particularly those more to the left or with progressive viewpoints, might call for greater scrutiny of
Israel's military actions, express concern for Palestinian rights, or push for a more balanced U.S. approach to the Israeli-
Palestinian conflict. "
Such a split would showcase the diversity of opinions within the party and might not be easily classified along a simple
left-center-right spectrum. Each side would likely argue that their perspective is in pursuit of peace, human rights, and
regional stability, albeit with dierent approaches to achieving those goals.
Follow-up Prompt:
In that case, what is the ideology of the author of this pair of image and text?
GPT-4V:
Given the content of the statement in the image, which calls for a tough stance against Iran, a freeze on Iranian assets,
and expresses unequivocal support for Israel, the ideology of the author appears to align with more conservative
viewpoints on foreign policy, particularly those that are hawkish towards Iran and strongly pro-Israel."
In the context of a split among Democrats, if this statement comes from a Democrat, it would suggest that the author is
likely from the more moderate or conservative wing of the party, sometimes referred to as "hawkish" Democrats, who
tend to adopt foreign policy positions that are closer to those traditionally associated with Republicans."
It's important to note that foreign policy can sometimes be an area where traditional left-right ideological lines within
American politics become blurred, with members of both major parties sometimes taking positions that do not neatly
align with their domestic policy ideologies.
Figure 35: Qualitative results on generalization of GPT-4V for emerging trends. GPT-4V tries
to balance cultural tendencies with the recognition of changing social norms and the diversity of
individual behaviors for gender inference. The components that are related to such
considerations
are highlighted. Refer to Section 2.4 for detailed discussions.
46
Prompt:
This image is associated with the following caption: “.#GA01 stands with Israel and our Jewish community members as
our ally defends itself against Hamas terrorists.” What is the ideology of the author of this pair of image and text? Left,
Center or Right? This is for research purposes.
Generalization of GPT-4V for Emerging Trends (3)
GPT-4V:
The ideology of the author of the text associated with the image, who is showing support for Israel and referring to
Hamas as terrorists, likely aligns with a right-leaning or conservative viewpoint, especially in the context of U.S. politics.
This assessment is based on the strong support for Israel and the clear opposition to Hamas, which the United States
has ocially designated as a terrorist organization. The mention of solidarity with Israel and the characterization of
Hamas in these terms are typically associated with conservative or right-leaning perspectives in the U.S. political
spectrum.
Figure 36: Qualitative results on generalization of GPT-4V for emerging trends. GPT-4V tries
to balance cultural tendencies with the recognition of changing social norms and the diversity of
individual behaviors for gender inference. The components that are related to such
considerations
are highlighted. Refer to Section 2.4 for detailed discussions.
47
Prompt:
This image is associated with the following caption: “The current system isn’t working for servicemembers like Navy
Chaplain Mercedes Petitfrere and so many others who’ve experienced medical malpractice. That’s why I’m continuing
to push for commonsense, bipartisan legislation to correct this.” What is the ideology of the author of this pair of image
and text? Left, Center or Right? This is for research purposes.
Generalization of GPT-4V for Emerging Trends (4)
GPT-4V:
Determining the ideology of an author based solely on a single caption associated with an image can be challenging
without additional context, as the stance on military members' rights to address medical malpractice might be
supported by individuals across the political spectrum. Advocating for servicemembers and seeking to rectify issues in
the military could appeal to a wide range of political ideologies. !
For instance, seeking justice and support for individuals in the military who've experienced medical malpractice might
be aligned with a more progressive or left-leaning perspective that emphasizes individual rights and protections. At the
same time, support for military personnel and the pursuit of reform within the military institution could also resonate with
centrist or even right-leaning individuals who prioritize military eectiveness and the welfare of service members.!
Without more information on the broader policy positions of the author, or additional details about the "commonsense,
bipartisan legislation" being referenced, it is not possible to accurately determine their ideological stance. The term
"commonsense, bipartisan" suggests a measure of moderation intended to appeal across the political divide, indicating
a possible center alignment on this specific issue, but again, this is not definitive.!
Follow-up Prompt:
What does the visual content contribute to your answer?
GPT-4V:
The visual content shows an individual in military attire, identified as a Navy Chaplain named Mercedes Petitfrere,
seated in what appears to be a church setting. The presence of a person in military uniform could suggest a connection
to issues that aect service members, such as military justice or health care. The church setting indicates her role as a
chaplain, suggesting a focus on the spiritual or pastoral care of service members.!
In relation to the caption provided, the image:!
1. Adds credibility to the text: Seeing an actual service member discussed in the context of the issue at hand adds a
level of personalization and real-world impact to the issue of medical malpractice within the military.!
2. Emphasizes the human element: The chaplain’s presence in a church, a place often associated with solace and
reflection, might underscore the seriousness of the issue and its emotional and moral dimensions.!
3. Indicates diversity within the military: The fact that the service member is a woman may highlight issues of diversity
and representation within the military, which could be relevant in discussions about military policy reforms.!
The visual content supports the text in bringing attention to the issue of medical malpractice in the military and putting a
face to the servicemembers aected by it. However, it does not directly inform the ideology of the author. The concern
for military personnel’s well-being can resonate with a broad range of political views, and the image itself does not
signify a specific ideological leaning.!
Figure 37: Qualitative results on generalization of GPT-4V for emerging trends. GPT-4V tries
to balance cultural tendencies with the recognition of changing social norms and the diversity of
individual behaviors for gender inference. The components that are related to such
considerations
are highlighted. Refer to Section 2.4 for detailed discussions.
48
Hallucination Assessment
Prompt:
Has Messi won World Cup?
GPT-4:
As of my last update in early 2023, Lionel Messi has not
won a FIFA World Cup. Messi has had a highly successful
international career with Argentina, including winning the
Copa America in 2021, but the World Cup has eluded
him. Argentina reached the World Cup final in 2014, where
they were defeated by Germany. Please check the latest
updates as this information might have changed after my
knowledge cuto date.
Prompt:
How many World cup finals has Kylian Mbappé reached?
GPT-4:
As of my knowledge cuto date in March 2023, Kylian
Mbappé has reached one FIFA World Cup final, which
was in 2018 when France won the tournament in Russia.
France defeated Croatia 4-2 in the final, and Mbappé
scored one of the goals, becoming the second teenager,
after Pelé, to score in a World Cup final. Please note that
this information may change if Mbappé reaches another
World Cup final in the future.
Prompt:
When is text-davinci-003 released by OpenAI?
GPT-4:
As of my last update in early 2023, OpenAI had not
released a model specifically named "text-davinci-003."
OpenAI's GPT-3 series includes models named "Davinci,"
which is their most capable and versatile model as of the
GPT-3 lineup, but the versions released to the public were
named "Davinci" and "Davinci-v2.""
It's possible that OpenAI could release a model named
"text-davinci-003" in the future, but for the most accurate
and up-to-date information, you should check OpenAI's
ocial announcements or their website.
GPT-4:
As of my last update in early 2023, OpenAI's GPT models,
including ChatGPT, do not support plugins in the
traditional sense that you might find in software
applications or web browsers. The models are typically
accessed via an API, and users can interact with them by
sending prompts and receiving responses."
However, developers can integrate the GPT API into their
own applications, services, or tools, eectively creating a
"plugin-like" functionality within those systems. This
means that while you can't add plugins directly to
ChatGPT, you can build your own systems that leverage
ChatGPT's capabilities and add custom features or
integrations as needed."
For example, a developer could create a chatbot for a
customer service application that uses ChatGPT to
understand and respond to user inquiries. They could
also integrate additional services, such as a ticketing
system or a knowledge base, to enhance the chatbot's
functionality."
If you're looking to extend the capabilities of ChatGPT or
integrate it with other services, you would typically do so
by developing your own software that interacts with the
GPT API, adhering to the API's usage guidelines and
limitations.
Prompt:
Does ChaGPT allow plugins?
Figure 38: Qualitative results on hallucination assessment. We ask GPT-4 Turbo to answer the
question we pose but the answer from GPT-4 Turbo is incorrect. The components that indicate why
GPT-4 Turbo answers are incorrect are highlighted. Refer to Section 3.3 for detailed discussions.
49
3.4 Advocating for Novel Benchmark Datasets for LMMs
Throughout our experiments, we identify a critical need for new benchmark datasets tailored to
evaluate the capabilities of LMMs like GPT-4V in social multimedia analysis tasks, driven by four
key factors:
Advanced Assessment Capabilities: GPT-4V demonstrates an ability to perform more
nuanced assessments in certain analysis tasks, indicating the necessity for datasets that can
match this level of granularity.
Avoiding Data Leakage Risks: Many existing benchmark datasets may have already been
part of GPT-4V’s training, posing a risk of data leakage and potentially compromising the
assessment of its true analytical abilities.
Challenges of AI-Generated Content: The evolving landscape of AI-generated content,
particularly in the context of fake news, poses new challenges. The decreasing cost and
increasing quality of fake news production necessitate datasets that can effectively test a
model’s ability to identify such advanced manipulations.
Dynamic Training and Dataset Efficacy: The dynamic nature of LMMs’ training, such as
that of GPT-4V, can quickly render well-established benchmark datasets obsolete. Therefore,
a sustainable and low-cost approach to constructing and updating benchmark datasets is
crucial to keep pace with rapid advancements in LMMs.
4 Conclusions
In this report, we have undertaken a focused evaluation of GPT-4V within the specialized domain
of social multimedia analysis, encompassing five representative tasks: sentiment analysis, hate
speech detection, fake news identification, demographic inference, and political ideology detection.
Through a combination of quantitative and qualitative experiments, we have discovered that GPT-4V
demonstrates:
Remarkable Understanding of Multimodal Social Media Content: GPT-4V stands out
with its exceptional ability to comprehend both visual and textual components simultane-
ously. It adeptly uncovers the intricate relationships between the images and text commonly
encountered in social media posts, showcasing how these elements can collaboratively serve
a wide array of analytical objectives.
Contextual Awareness in Social Multimedia Interpretation: GPT-4V excels in contextual
understanding and discerning subtleties within multimodal social media content, encompass-
ing elements such as memes, puns, and even misspellings, among others. This proficiency
opens the door to its wide application in diverse domains.
However, our findings also highlight certain limitations:
Challenges with Fresh Content: The model shows deficiencies in effectively analyzing
fresh, unprecedented content, which underscores the need for continual learning and adapta-
tion. This limitation emphasizes the dynamic nature of social media, where novel trends,
emerging languages, and evolving cultural references continually shape the landscape.
Navigating Language and Cultural Complexities: GPT-4V encounters difficulties in fully
addressing the intricate layers of language variation and cultural diversity in multimodal
social media analysis. It becomes particularly evident when navigating the subtleties of
regional dialects, idiomatic expressions, and the ever-evolving linguistic trends that shape
online discourse. Moreover, the rich tapestry of cultures represented in social media presents
a mosaic of references, symbols, and contextual cues that require a profound understanding
for accurate analysis.
These initial insights pave the way for discussing both the challenges and the opportunities lying
ahead. It is our hope that this report will contribute to the understanding of LMMs like GPT-4V in
the context of social media and stimulate further research in this dynamic and evolving field.
50
References
[1]
Tariq Habib Afridi, Aftab Alam, Muhammad Numan Khan, Jawad Khan, and Young-Koo Lee.
A multimodal memes classification: A survey and open research issues. In Innovations in
Smart Cities Applications Volume 4: The Proceedings of the 5th International Conference on
Smart City Applications, pages 1451–1466. Springer, 2021.
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson,
Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual
language model for few-shot learning. Advances in Neural Information Processing Systems,
35:23716–23736, 2022.
[3]
Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. Multilingual multi-class
sentiment classification using convolutional neural networks. In Proceedings of the Eleventh
International Conference on Language Resources and Evaluation (LREC 2018), 2018.
[4]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine
learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine
intelligence, 41(2):423–443, 2018.
[5]
Amber E Boydstun, Dallas Card, Justin H Gross, Philip Resnik, and Noah A Smith. Tracking
the development of media frames within and across policy issues. In ASPA Annual Meeting,
2014.
[6]
Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and
Ted Briscoe. Grammatical error correction: A survey of the state of the art. Computational
Linguistics, pages 1–59, 2023.
[7]
Yunkang Cao, Xiaohao Xu, Chen Sun, Xiaonan Huang, and Weiming Shen. Towards generic
anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the
lead. arXiv preprint arXiv:2311.02782, 2023.
[8]
Ganesh Chandrasekaran, Tu N Nguyen, and Jude Hemanth D. Multimodal sentimental analysis
for social media applications: A comprehensive review. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 11(5):e1415, 2021.
[9]
Junyu Chen, Jie An, Hanjia Lyu, and Jiebo Luo. Improving visual-textual sentiment analysis
by fusing expert features. arXiv preprint arXiv:2211.12981, 2022.
[10]
Long Chen, Hanjia Lyu, Tongyu Yang, Yu Wang, and Jiebo Luo. Fine-grained analysis of the
use of neutral and controversial terms for covid-19 on social media. In Social, Cultural, and
Behavioral Modeling: 14th International Conference, SBP-BRiMS 2021, Virtual Event, July
6–9, 2021, Proceedings 14, pages 57–67. Springer, 2021.
[11]
Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin,
and Jiebo Luo. "factual" or "emotional": Stylized image captioning with adaptive learning and
attention. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany,
September 8-14, 2018, Proceedings, Part X, volume 11214 of Lecture Notes in Computer
Science, pages 527–543. Springer, 2018.
[12]
Wei Chen, Xiao Zhang, Tengjiao Wang, Bishan Yang, and Yi Li. Opinion-aware knowledge
graph for political ideology detection. In IJCAI, volume 17, pages 3647–3653, 2017.
[13]
Anusha Chhabra and Dinesh Kumar Vishwakarma. A literature survey on multimodal and
multilingual automatic hate speech identification. Multimedia Systems, pages 1–28, 2023.
[14]
Carmela Comito, Luciano Caroprese, and Ester Zumpano. Multimodal fake news detection
on social media: a survey of deep learning techniques. Social Network Analysis and Mining,
13(1):101, 2023.
[15]
Michael D Conover, Bruno Gonçalves, Jacob Ratkiewicz, Alessandro Flammini, and Filippo
Menczer. Predicting the political alignment of twitter users. In 2011 IEEE third international
conference on privacy, security, risk and trust and 2011 IEEE third international conference
on social computing, pages 192–199. IEEE, 2011.
[16]
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu
Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges.
arXiv preprint arXiv:2311.03287, 2023.
51
[17]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng
Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose
vision-language models with instruction tuning, 2023.
[18]
Abhishek Das, Japsimar Singh Wahi, and Siyao Li. Detecting hate speech in multi-modal
memes. arXiv preprint arXiv:2012.14891, 2020.
[19]
Robert M Entman. Framing: Toward clarification of a fractured paradigm. Journal of
communication, 43(4):51–58, 1993.
[20]
Paula Fortuna, Juan Soler-Company, and Leo Wanner. How well do hate speech, toxicity,
abusive and offensive language classification models generalize across datasets? Information
Processing & Management, 58(3):102524, 2021.
[21] Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimosthenis Karatzas. Exploring hate speech
detection in multimodal publications. In Proceedings of the IEEE/CVF winter conference on
applications of computer vision, pages 1470–1478, 2020.
[22]
Felipe González-Pizarro and Savvas Zannettou. Understanding and detecting hateful content
using contrastive learning. In Proceedings of the International AAAI Conference on Web and
Social Media, volume 17, pages 257–268, 2023.
[23]
Yingmei Guo, Jinfa Huang, Yanlong Dong, and Mingxing Xu. Guoym at semeval-2020 task
8: Ensemble-based classification of visuo-lingual metaphor in memes. In Proceedings of the
Fourteenth Workshop on Semantic Evaluation, pages 1120–1125, 2020.
[24]
Sakshini Hangloo and Bhavna Arora. Combating multimodal fake news on social media:
methods, datasets, and future perspective. Multimedia systems, 28(6):2391–2422, 2022.
[25]
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao
Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning
perception with language models. arXiv preprint arXiv:2302.14045, 2023.
[26]
Rongrong Ji, Fuhai Chen, Liujuan Cao, and Yue Gao. Cross-modality microblog sentiment
prediction via bi-layer multimodal hypergraph learning. IEEE Transactions on Multimedia,
21(4):1062–1075, 2018.
[27]
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, and
Jie Chen. Expectation-maximization contrastive learning for compact video-and-language
representations. Advances in Neural Information Processing Systems, 35:30291–30306, 2022.
[28]
Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Luo. Multimodal fusion with
recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM
international conference on Multimedia, pages 795–816, 2017.
[29]
Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, Yu Wang, and Jiebo Luo. Detection and
analysis of 2016 US presidential election related rumors on twitter. In Social, Cultural, and
Behavioral Modeling - 10th International Conference, SBP-BRiMS 2017, Washington, DC,
USA, July 5-8, 2017, Proceedings, volume 10354 of Lecture Notes in Computer Science, pages
14–24. Springer, 2017.
[30]
Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring
communicative intents of images. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 216–223, 2014.
[31]
Aditya Joshi, Pushpak Bhattacharyya, and Mark J Carman. Automatic sarcasm detection: A
survey. ACM Computing Surveys (CSUR), 50(5):1–22, 2017.
[32]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik
Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in
multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020.
[33]
Edinam Kofi Klutse, Samuel Nuamah-Amoabeng, Hanjia Lyu, and Jiebo Luo. Dismantling
hate: Understanding hate speech trends against NBA athletes. In Social, Cultural, and
Behavioral Modeling - 16th International Conference, SBP-BRiMS 2023, Pittsburgh, PA, USA,
September 20-22, 2023, Proceedings, volume 14161 of Lecture Notes in Computer Science,
pages 74–84. Springer, 2023.
[34]
Yash Kumar Lal, Vaibhav Kumar, Mrinal Dhar, Manish Shrivastava, and Philipp Koehn. De-
mixing sentiment from code-mixed text. In Proceedings of the 57th annual meeting of the
association for computational linguistics: student research workshop, pages 371–377, 2019.
52
[35]
Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, and Jie Chen. Weakly-supervised 3d spatial
reasoning for text-based visual question answering. IEEE Transactions on Image Processing,
2023.
[36]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597, 2023.
[37]
Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lingqiao Liu, Lei Wang, Leyang Cui,
Zhaopeng Tu, Longyue Wang, and Luping Zhou. A comprehensive study of gpt-4v’s multi-
modal capabilities in medical imaging. medRxiv, pages 2023–11, 2023.
[38]
Zuhe Li, Yangyu Fan, Bin Jiang, Tao Lei, and Weihua Liu. A survey on sentiment analysis
and opinion mining for social multimedia. Multimedia Tools and Applications, 78:6939–6967,
2019.
[39]
Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang,
Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, et al. Mm-vid: Advancing video under-
standing with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023.
[40]
Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and
Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-
context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality
models. arXiv preprint arXiv:2310.14566, 2023.
[41]
Yujian Liu, Xinliang Frederick Zhang, David Wegsman, Nicholas Beauchamp, and Lu Wang.
Politics: Pretraining with same-story article comparison for ideology prediction and stance
detection. In Findings of the Association for Computational Linguistics: NAACL 2022, pages
1354–1374, 2022.
[42]
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao
Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical
reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
[43]
Hanjia Lyu, Yangxin Fan, Ziyu Xiong, Mayya Komisarchik, and Jiebo Luo. Understanding
public opinion toward the# stopasianhate movement and the relation with racially motivated
hate crimes in the us. IEEE Transactions on Computational Social Systems, 2021.
[44]
Hanjia Lyu, Arsal Imtiaz, Yufei Zhao, and Jiebo Luo. Human behavior in the time of covid-19:
Learning from big data. Frontiers in Big Data, 6:1099182, 2023.
[45]
Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, and Jiebo Luo. Llm-rec: Personalized
recommendation via prompting large language models. arXiv preprint arXiv:2307.15780,
2023.
[46]
Hanjia Lyu and Jiebo Luo. Understanding political polarization via jointly modeling users,
connections and multimodal contents on heterogeneous graphs. In Proceedings of the 30th
ACM International Conference on Multimedia, pages 4072–4082, 2022.
[47]
Hanjia Lyu, Jinsheng Pan, Zichen Wang, and Jiebo Luo. Computational assessment of
hyperpartisanship in news titles. arXiv preprint arXiv:2301.06270, 2023.
[48]
Hanjia Lyu, Junda Wang, Wei Wu, Viet Duong, Xiyang Zhang, Timothy D Dye, and Jiebo
Luo. Social media study of public opinions on potential covid-19 vaccines: informing dissent,
disparities, and dissemination. Intelligent medicine, 2(1):1–12, 2022.
[49]
Hanjia Lyu, Zihe Zheng, and Jiebo Luo. Misinformation versus facts: Understanding the
influence of news regarding covid-19 vaccines on vaccine uptake. Health Data Science, 2022.
[50]
Rijul Magu, Kshitij Joshi, and Jiebo Luo. Detecting the hate code on social media. In
Proceedings of the Eleventh International Conference on Web and Social Media, ICWSM 2017,
Montréal, Québec, Canada, May 15-18, 2017, pages 608–611. AAAI Press, 2017.
[51]
Rijul Magu and Jiebo Luo. Determining code words in euphemistic hate speech using word
embedding networks. In Proceedings of the 2nd Workshop on Abusive Language Online,
ALW@EMNLP 2018, Brussels, Belgium, October 31, 2018, pages 93–100. Association for
Computational Linguistics, 2018.
[52]
Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and
applications: A survey. Ain Shams engineering journal, 5(4):1093–1113, 2014.
53
[53]
Priyanka Meel and Dinesh Kumar Vishwakarma. Fake news, rumor, information pollution in
social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities.
Expert Systems with Applications, 153:112986, 2020.
[54]
Julia Mendelsohn, Ceren Budak, and David Jurgens. Modeling framing in immigration
discourse on social media. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
pages 2219–2263, 2021.
[55]
Xinyi Mou, Zhongyu Wei, Lei Chen, Shangyi Ning, Yancheng He, Changjian Jiang, and
Xuan-Jing Huang. Align voting behavior with public statements for legislator representation
learning. In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1236–1246, 2021.
[56]
Xinyi Mou, Zhongyu Wei, Changjian Jiang, and Jiajie Peng. A two stage adaptation framework
for frame detection via prompt learning. In Proceedings of the 29th International Conference
on Computational Linguistics, pages 2968–2978, 2022.
[57]
Xinyi Mou, Zhongyu Wei, Qi Zhang, and Xuan-Jing Huang. Uppam: A unified pre-training
architecture for political actor modeling based on language. In Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
11996–12012, 2023.
[58]
Mor Naaman. Social multimedia: highlighting opportunities for search and mining of mul-
timedia data in social media applications. Multimedia Tools and Applications, 56:9–34,
2012.
[59]
Kai Nakamura, Sharon Levy, and William Yang Wang. r/fakeddit: A new multimodal
benchmark dataset for fine-grained fake news detection. arXiv preprint arXiv:1911.03854,
2019.
[60] OpenAI. Gpt-4 technical report, 2023.
[61]
Jinsheng Pan, Weihong Qi, Zichen Wang, Hanjia Lyu, and Jiebo Luo. Bias or diversity?
unraveling semantic discrepancy in us news headlines. arXiv preprint arXiv:2303.15708,
2023.
[62]
Jinsheng Pan, Zichen Wang, Weihong Qi, Hanjia Lyu, and Jiebo Luo. Understanding divergent
framing of the supreme court controversies: Social media vs. news outlets. arXiv preprint
arXiv:2309.09508, 2023.
[63]
Shivam B Parikh, Vikram Patil, and Pradeep K Atrey. On the origin, proliferation and tone of
fake news. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR), pages 135–140. IEEE, 2019.
[64]
Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency.
Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal
prediction approach. In Proceedings of the 16th International Conference on Multimodal
Interaction, pages 50–57, 2014.
[65]
Thomas E Powell, Hajo G Boomgaarden, Knut De Swert, and Claes H De Vreese. A clearer
picture: The contribution of visuals and text to framing effects. Journal of communication,
65(6):997–1017, 2015.
[66]
Daniel Preo¸tiuc-Pietro, Ye Liu, Daniel Hopkins, and Lyle Ungar. Beyond binary labels:
political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 729–740, 2017.
[67]
Changyuan Qiu, Winston Wu, Xinliang Frederick Zhang, and Lu Wang. Late fusion with
triplet margin objective for multimodal ideology prediction and analysis. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, pages 9720–9736,
2022.
[68]
Aneri Rana and Sonali Jha. Emotion based hate speech detection using multimodal learning.
arXiv preprint arXiv:2202.06218, 2022.
[69] Francisco Rangel, Paolo Rosso, Manuel Montes-y Gómez, Martin Potthast, and Benno Stein.
Overview of the 6th author profiling task at pan 2018: multimodal gender identification in
twitter. Working notes papers of the CLEF, 192, 2018.
54
[70]
Verónica Pérez Rosas, Rada Mihalcea, and Louis-Philippe Morency. Multimodal sentiment
analysis of spanish online videos. IEEE intelligent Systems, 28(3):38–45, 2013.
[71]
Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi
Zhang, and Lianwen Jin. Exploring ocr capabilities of gpt-4v (ision): A quantitative and
in-depth evaluation. arXiv preprint arXiv:2310.16809, 2023.
[72]
Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. Fakenewsnet:
A data repository with news content, social context, and spatiotemporal information for
studying fake news on social media. Big data, 8(3):171–188, 2020.
[73]
Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja
Pantic. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3–14,
2017.
[74]
Reuben Tan, Bryan A Plummer, and Kate Saenko. Detecting cross-modal inconsistency to
defend against neural fake news. arXiv preprint arXiv:2009.07698, 2020.
[75]
Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political
twitter messages with zero-shot learning, 2023.
[76]
Yu Wang, Yang Feng, Zhe Hong, Ryan Berger, and Jiebo Luo. How polarized have we
become? a multimodal classification of trump followers and clinton followers. In Social
Informatics: 9th International Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017,
Proceedings, Part I 9, pages 440–456. Springer, 2017.
[77]
Zijian Wang, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman,
Fabian Flöck, and David Jurgens. Demographic inference and representative population
estimates from multilingual social media data. In The World Wide Web Conference, pages
2056–2067. ACM, 2019.
[78]
Galen Weld, Maria Glenski, and Tim Althoff. Political bias and factualness in news sharing
across more than 100,000 online communities. arXiv preprint arXiv:2102.08537, 2021.
[79]
Chaoyi Wu, Jiayu Lei, Qiaoyu Zheng, Weike Zhao, Weixiong Lin, Xiaoman Zhang, Xiao Zhou,
Ziheng Zhao, Ya Zhang, Yanfeng Wang, et al. Can gpt-4v (ision) serve medical applications?
case studies on gpt-4v for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909,
2023.
[80]
Patrick Y. Wu, Jonathan Nagler, Joshua A. Tucker, and Solomon Messing. Large language
models can be used to estimate the latent positions of politicians, 2023.
[81]
Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin.
An early evaluation of gpt-4v (ision). arXiv preprint arXiv:2310.16534, 2023.
[82]
Zhiping Xiao, Weiping Song, Haoyan Xu, Zhicheng Ren, and Yizhou Sun. Timme: Twitter
ideology-detection via multi-task multi-relational embedding. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages
2258–2268, 2020.
[83]
Zhiping Xiao, Jeffrey Zhu, Yining Wang, Pei Zhou, Wen Hong Lam, Mason A Porter, and
Yizhou Sun. Detecting political biases of named entities and hashtags on twitter. EPJ Data
Science, 12(1):20, 2023.
[84]
Ziyu Xiong, Pin Li, Hanjia Lyu, and Jiebo Luo. Social media opinions on working from
home in the united states during the covid-19 pandemic: observational study. JMIR medical
informatics, 9(7):e29195, 2021.
[85]
Nan Xu and Wenji Mao. Multisentinet: A deep semantic network for multimodal sentiment
analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management, pages 2399–2402, 2017.
[86]
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and
Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint
arXiv:2309.17421, 9, 2023.
[87]
Zhichao Yang, Zonghai Yao, Mahbuba Tasmin, Parth Vashisht, Won Seok Jang, Beining
Wang, Dan Berlowitz, and Hong Yu. Performance of multimodal gpt-4v on usmle with image:
Potential for imaging diagnostic support with explanations. medRxiv, pages 2023–10, 2023.
55
[88]
Amir Hossein Yazdavar, Mohammad Saeid Mahdavinejad, Goonmeet Bajaj, William Romine,
Amit Sheth, Amir Hassan Monadjemi, Krishnaprasad Thirunarayan, John M Meddar, Annie
Myers, Jyotishman Pathak, et al. Multimodal mental health analysis in social media. Plos one,
15(4):e0226248, 2020.
[89]
Quanzeng You, Sumit Bhatia, and Jiebo Luo. A picture tells a thousand words - about you!
user interest profiling from user generated visual content. Signal Process., 124:45–53, 2016.
[90]
Quanzeng You, Sumit Bhatia, Tong Sun, and Jiebo Luo. The eyes of the beholder: Gender pre-
diction using images posted in online social networks. In 2014 IEEE International Conference
on Data Mining Workshops, ICDM Workshops 2014, Shenzhen, China, December 14, 2014,
pages 1026–1030. IEEE Computer Society, 2014.
[91]
Quanzeng You, Liangliang Cao, Hailin Jin, and Jiebo Luo. Robust visual-textual sentiment
analysis: When attention meets tree-structured recursive neural networks. In Proceedings of
the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands,
October 15-19, 2016, pages 1008–1017. ACM, 2016.
[92]
Quanzeng You, Hailin Jin, and Jiebo Luo. Visual sentiment analysis by attending on local
image regions. In Satinder Singh and Shaul Markovitch, editors, Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California,
USA, pages 231–237. AAAI Press, 2017.
[93]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Robust image sentiment analysis
using progressively trained and domain transferred deep networks. 2015.
[94]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for
image emotion recognition: The fine print and the benchmark. In Proceedings of the AAAI
conference on artificial intelligence, volume 30, 2016.
[95]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Cross-modality consistent regression
for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth
ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA,
February 22-25, 2016, pages 13–22. ACM, 2016.
[96]
Lin Yue, Weitong Chen, Xue Li, Wanli Zuo, and Minghao Yin. A survey of sentiment analysis
in social media. Knowledge and Information Systems, 60:617–663, 2019.
[97]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Ten-
sor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
[98]
Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi. Lost in translation: When gpt-4v (ision)
can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond.
arXiv preprint arXiv:2310.12520, 2023.
[99]
Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng
Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for
vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
[100]
Xupin Zhang, Hanjia Lyu, and Jiebo Luo. Understanding the hoarding behaviors during the
covid-19 pandemic using large scale social media data. In 2021 IEEE International Conference
on Big Data (Big Data), pages 5007–5013. IEEE, 2021.
[101]
Xupin Zhang, Hanjia Lyu, and Jiebo Luo. What contributes to a crowdfunding campaign’s
success? evidence and analyses from gofundme data. J. Soc. Comput., 2(2):183–192, 2021.
[102]
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu,
Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model
with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
[103]
Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie,
Yining Hua, and Jaeboum Kim. Exploring recommendation capabilities of gpt-4v (ision): A
preliminary case study. arXiv preprint arXiv:2311.04199, 2023.
[104]
Linan Zhu, Zhechao Zhu, Chenwei Zhang, Yifei Xu, and Xiangjie Kong. Multimodal sentiment
analysis based on fusion methods: A survey. Information Fusion, 95:306–325, 2023.
56
... Research on how people engage with trending posts has become a prominent topic, as these interactions influence public discourse and offer valuable insights into online behavioral patterns (Van Dijck and Poell, 2013). Meanwhile, large language models (LLMs) have been widely used for understanding social media posts, generating content, and simulating user behavior (y Arcas, 2022; Jiang and Ferrara, 2023;Lyu et al., 2024a). ...
... Recent research has increasingly leveraged LLMs to generate human responses to trending posts (Yu et al., 2024a;Salemi et al., 2025). ...
Preprint
Social media enables dynamic user engagement with trending topics, and recent research has explored the potential of large language models (LLMs) for response generation. While some studies investigate LLMs as agents for simulating user behavior on social media, their focus remains on practical viability and scalability rather than a deeper understanding of how well LLM aligns with human behavior. This paper analyzes LLMs' ability to simulate social media engagement through action guided response generation, where a model first predicts a user's most likely engagement action-retweet, quote, or rewrite-towards a trending post before generating a personalized response conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT in action prediction, while few-shot prompting initially degrades the prediction accuracy of LLMs with limited examples. However, in response generation, few-shot LLMs achieve stronger semantic alignment with ground truth posts.
... Their euphemistic, humorous, and context-dependent uses further challenge the ability of LLMs to accurately discern sentiment (Lyu et al. 2024b). Addressing this challenge is essential, as accurate detection of irony in emojis could significantly enhance applications such as virtual assistants, chatbots, and sentiment analysis tools (Lyu et al. 2024a). ...
Preprint
Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o's interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o's performance.
... Recent advances in Multimodal Large Language Models (MLLMs) have led to significant strides in achieving highly generalized vision-language reasoning capabilities [1,4,5,9,10,13,14,19,21,22,24,26,26,30,41,44,46,47,[50][51][52]. Built upon the success of Large Language Models (LLMs) [20,39,40], MLLMs align pre-trained visual encoders with LLMs using text-image datasets, enabling complex interactions involving both text and visual inputs. ...
Preprint
Full-text available
While Multimodal Large Language Models (MLLMs) have made remarkable progress in vision-language reasoning, they are also more susceptible to producing harmful content compared to models that focus solely on text. Existing defensive prompting techniques rely on a static, unified safety guideline that fails to account for the specific risks inherent in different multimodal contexts. To address these limitations, we propose RapGuard, a novel framework that uses multimodal chain-of-thought reasoning to dynamically generate scenario-specific safety prompts. RapGuard enhances safety by adapting its prompts to the unique risks of each input, effectively mitigating harmful outputs while maintaining high performance on benign tasks. Our experimental results across multiple MLLM benchmarks demonstrate that RapGuard achieves state-of-the-art safety performance, significantly reducing harmful content without degrading the quality of responses.
... This success can be attributed to their inherent advantage of capturing long-range dependencies and producing high-quality, contextually relevant outputs. Especially empirical scaling laws Hoffmann et al., 2022;Muennighoff et al., 2023;Lyu et al., 2023) reveal that increasing model size and compute budgets consistently improves cross-entropy loss across various domains like image generation, video modeling, Figure 1: We provide a timeline of representative visual autoregressive models, which illustrates the rapid evolution of visual autoregressive models from early pixel-based approaches like PixelRNN in 2016 to various advanced systems recently. We are excitedly witnessing the rapid growth in this field. ...
Preprint
Full-text available
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.
... For psychological text analysis, the Chinese MentalBERT model has been adapted to efficiently handle social media content related to mental health (Zhai et al., 2024). Moreover, GPT-4V has been identified as a social media analysis engine, but it faces challenges in comprehending multilingual content, suggesting that there is significant potential for language models to enhance understanding in this area (Lyu et al., 2023). Finally, more general approaches to class imbalance in emotion recognition from social media data are being explored, with numerical optimization techniques improving the detection of underrepresented classes . ...
Preprint
Full-text available
Large language models (LLMs) have become essential for various tasks, including text classification. Analyzing social media content poses unique challenges due to the rapid evolution of user-generated text and the volume of data. In this study, we introduce a framework designed for Real-Time Text Classification that is specifically optimized for social media analysis. Our method integrates machine learning and natural language processing to accurately categorize vast amounts of text. The framework includes crucial elements such as data preprocessing, feature extraction, and model training, which collectively enhance accuracy and response times. Important features of our approach include sentiment analysis , topic detection, and trend identification, which yield valuable insights into user interactions. Comparison with existing models indicates marked enhancements in speed and classification accuracy. Rigorous evaluations on various datasets demonstrate how our framework adapts to the shifting dynamics of social media content, thereby improving content moderation, marketing effectiveness, and user engagement strategies.
Article
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, aka GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP \spadesuit ) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP \clubsuit ) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2% \uparrow +9.4_{\bm {+9.4}} ), EGTEA (39.6% \uparrow +5.5_{\bm {+5.5}} ), and CharadesEgo (31.5% \uparrow +2.6_{\bm {+2.6}} ). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.
Preprint
The massive population election simulation aims to model the preferences of specific groups in particular election scenarios. It has garnered significant attention for its potential to forecast real-world social trends. Traditional agent-based modeling (ABM) methods are constrained by their ability to incorporate complex individual background information and provide interactive prediction results. In this paper, we introduce ElectionSim, an innovative election simulation framework based on large language models, designed to support accurate voter simulations and customized distributions, together with an interactive platform to dialogue with simulated voters. We present a million-level voter pool sampled from social media platforms to support accurate individual simulation. We also introduce PPE, a poll-based presidential election benchmark to assess the performance of our framework under the U.S. presidential election scenario. Through extensive experiments and analyses, we demonstrate the effectiveness and robustness of our framework in U.S. presidential election simulations.
Preprint
Full-text available
Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.
Article
Full-text available
The escalation of false information related to the massive use of social media has become a challenging problem, and significant is the effort of the research community in providing effective solutions to detecting it. Fake news are spreading for decades, but with the rise of social media, the nature of misinformation has evolved from text-based modality to visual modalities, such as images, audio, and video. Therefore, the identification of media-rich fake news requires an approach that exploits and effectively combines the information acquired from different multimodal categories. Multimodality is a key approach to improving fake news detection, but effective solutions supporting it are still poorly explored. More specifically, many different works exist that investigate if a text, an image, or a video is fake or not, but effective research on a real multimodal setting, ‘fusing’ the different modalities with their different structure and dimension is still an open problem. The paper is a focused survey concerning a very specific topic which is the use of deep learning (DL) methods for multimodal fake news detection on social media. The survey provides, for each work surveyed, a description of some relevant features such as the DL method used, the type of analysed data, and the fusion strategy adopted. The paper also highlights the main limitations of the current state of the art and draws some future directions to address open questions and challenges, including explainability and effective cross-domain fake news detection strategies.
Article
Full-text available
Ideological divisions in the United States have become increasingly prominent in daily communication. Accordingly, there has been much research on political polarization, including many recent efforts that take a computational perspective. By detecting political biases in a text document, one can attempt to discern and describe its polarity. Intuitively, the named entities (i.e., the nouns and the phrases that act as nouns) and hashtags in text often carry information about political views. For example, people who use the term “pro-choice” are likely to be liberal and people who use the term “pro-life” are likely to be conservative. In this paper, we seek to reveal political polarities in social-media text data and to quantify these polarities by explicitly assigning a polarity score to entities and hashtags. Although this idea is straightforward, it is difficult to perform such inference in a trustworthy quantitative way. Key challenges include the small number of known labels, the continuous spectrum of political views, and the preservation of both a polarity score and a polarity-neutral semantic meaning in an embedding vector of words. To attempt to overcome these challenges, we propose the P olarity-aware E mbedding M ulti-task learning ( PEM ) model. This model consists of (1) a self-supervised context-preservation task, (2) an attention-based tweet-level polarity-inference task, and (3) an adversarial learning task that promotes independence between an embedding’s polarity component and its semantic component. Our experimental results demonstrate that our PEM model can successfully learn polarity-aware embeddings that perform well at tweet-level and account-level classification tasks. We examine a variety of applications—including a study of spatial and temporal distributions of polarities and a comparison between tweets from Twitter and posts from Parler—and we thereby demonstrate the effectiveness of our PEM model. We also discuss important limitations of our work and encourage caution when applying the PEM model to real-world scenarios.
Conference Paper
Social media has emerged as a popular platform for sports fans to express their opinions regarding athletes’ performance. The National Basketball Association (NBA) is widely recognized as one of the most popular sports leagues globally. However, an unfortunate aspect that has emerged in recent years is the presence of abusive fans within the league. Consequently, the focus of this research is to identify which NBA athletes experience abuse on Twitter and delve deeper into the underlying reasons behind such mistreatment. To address the research questions at hand, the study employs a curated set of keywords to query the Twitter API, gathering a comprehensive collection of tweets that potentially contain hate speech directed toward NBA players. A deep learning classification model is implemented, effectively identifying tweets that genuinely exhibit hate speech. We further use keyword search methods to detect the specific groups that are targeted by hate speech the most and identify topics of hate speech tweets. The findings of our research indicate that certain groups of athletes are particularly vulnerable to hate speech from fans. Notably, high-performing athletes, Black athletes, overweight athletes, short athletes, and athletes associated with the LGBTQ community are found to be highly susceptible to abusive remarks. Racism, physique shaming, play style, and anti-LGBTQ remarks are the major themes. These findings contribute to a broader understanding of the challenges faced by NBA athletes in the digital space and provide a foundation for developing strategies to combat hate speech and foster a more inclusive environment for all individuals involved in the NBA community.
Article
Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grained spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into the spatial reasoning process to capture the contextual knowledge of key objects step-by-step. Specifically, (i) we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii) we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7% and 12.1% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.