PreprintPDF Available

Build An Influential Bot In Social Media Simulations With Large Language Models

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Understanding the dynamics of public opinion evolution on online social platforms is critical for analyzing influence mechanisms. Traditional approaches to influencer analysis are typically divided into qualitative assessments of personal attributes and quantitative evaluations of influence power. In this study, we introduce a novel simulated environment that combines Agent-Based Modeling (ABM) with Large Language Models (LLMs), enabling agents to generate posts, form opinions, and update follower networks. This simulation allows for more detailed observations of how opinion leaders emerge. Additionally, we present an innovative application of Reinforcement Learning (RL) to replicate the process of opinion leader formation. Our findings reveal that limiting the action space and incorporating self-observation are key factors for achieving stable opinion leader generation. The learning curves demonstrate the model's capacity to identify optimal strategies and adapt to complex, unpredictable dynamics.
Content may be subject to copyright.
Build An Influential Bot In Social Media
Simulations With Large Language
Models
Bailu Jin1and Weisi Guo2
1Cranfield University
Correspondence should be addressed to weisi.guo@cranfield.ac.uk
Journal of Artificial Societies and Social Simulation xx(x) x, 20xx
Doi: 10.18564/jasss.xxxx Url: http://jasss.soc.surrey.ac.uk/xx/x/x.html
Received: dd-mmm-yyyy Accepted: dd-mmm-yyyy Published: dd-mmm-yyyy
Abstract:
Understanding the dynamics of public opinion evolution on online social platforms is critical for analyzing in-
fluence mechanisms. Traditional approaches to influencer analysis are typically divided into qualitative as-
sessments of personal attributes and quantitative evaluations of influence power. In this study, we introduce a
novel simulated environment that combines Agent-Based Modeling (ABM) with Large Language Models (LLMs),
enabling agents to generate posts, form opinions, and update follower networks. This simulation allows for
more detailed observations of how opinion leaders emerge. Additionally, we present an innovative application
of Reinforcement Learning (RL) to replicate the process of opinion leader formation. Our findings reveal that
limiting the action space and incorporating self-observation are key factors for achieving stable opinion leader
generation. The learning curves demonstrate the model’s capacity to identify optimal strategies and adapt to
complex, unpredictable dynamics.
Keywords: Agent-Based Modelling, Large Language Model, Reinforcement Learning
Introduction
1.1 The emergence of online social media has revolutionized public opinion measurement and representation. Tra-
ditional macroscopic approaches toanalyzing social phenomena oen struggle to capture the intricacy, discon-
tinuity and heterogeneity in individual behaviors.
1.2 Agent-Based Modeling(ABM) has proven eective in analyzing and explaining social phenomena in a more de-
tailed level. ABMs apply experimental control over large population, enabling researchers to capture the non-
linear societal mechanisms and reveal large-scale societal emergenceJackson et al. (2017). Various rules within
ABM have been developed to explain the phenomena of consensus, clustering and bi-polarity in public opinion.
1.3 However, ABMs primarily function through mathematical representations and do not inherently integrate nat-
ural language entities. Our research aim to bridge this gap between computational simulations and linguistic
data by integrating ABM and Large Language Models(LLMs). LLMs have been used as a source of simulated
respondents in various studies, demonstrating the ability to predict user’s future behaviour with a degree of
accuracy.
1.4 In our research, we have developed a simulated environment that replicates the dynamics of group discussions
on a specific topic in an online social media context. This environment is structured with agents representing
individual participant and follow relationships as links. Agents release topic-related posts to represent the cur-
rent opinion and the links will be updated due to opinion states. LLMs are used for post generation and opinion
elicitation for agents. The generated posts are not only relevant to the topic but also reflect the complex influ-
ences present in a social media environment. The simulation result aligns with the polarity distribution of the
actual dataset, indicating that our simulated environment reflects realistic discussion patterns.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
arXiv:2411.19635v1 [cs.SI] 29 Nov 2024
1.5 Given the simulated environment that indicate the full opinion evolution process of a discussion on one topic.
Now we move to investigate in individual behaviors of one particular target agent.
1.6 Reinforcement Learning(RL) is a simulation method which has the capability to mimic the human-like action
selection. Using RL model, the target agent learn optimal strategies based on the interaction with environment.
We specifically focus on training RL models on a target agent with the objective of maximizing its follower count
in the simulated network. Our study explores various environment and observability settings to understand
their impact on the agent’s ability to achieve this goal. In our designed cases , the convergence in RL reward
learning curves indicating RL achieved optimal solutions.
1.7 The potential applications of this research are diverse across multiple domains. Marketing firms could leverage
the model to predict the eectiveness of promotional strategies by simulating how opinions evolve in response
to targeted campaigns. For instance, social acceptance of emerging technologies like 6G is crucial for business
and government (Briguglio et al. (2021)), policymakers could use the tool to simulate and evaluate the public’s
response to communication strategies. For cyber security, this research could assist in identifying malicious
activities such as influence campaigns or the spread of misinformation.
1.8 Here we summarise the major contributions of this paper:
A novel application of RL to opinion leader generation in a simulated social media environment.
The findings that limiting the action space and incorporating self-observation are key factors for stable
generation under varying environments.
Related Work
Opinion Leader
Agent-Based Modelling
2.1 Agent-based modelling has been widely used by social researchers in the field of social simulation. In the mod-
elling, the agents represent human individuals, interact with each other and with the system environment ex-
plicitly and individually. Compared to the other tools, such as laboratory or filed experiments, ABMs apply the
experimental control on large scale population to capture the nonlinear societal mechanisms and reveal large-
scale societal emergence (Jackson et al. (2017)).
2.2 However, high control of ABMs leads to a low degree of external validity. To solve this problem, Betz presented a
natural-language agent-based model of argumentation to simulate the argumentative opinion dynamics (Betz
(2021)). The explicit stance used in traditional simulation of opinion dynamics can be represented by implicit
argument from natural-language ABMAs.
Large Language Model
2.3 LLM such as GPT has been tested for multiple natural language processing tasks, and proved to be eective
without gradient updates or fine-tuning (Brown et al. (2020)). Since GPT-3 was trained on large web corpora
which include social media behavior, GPT-3 can predict user’s future behaviours, responses, or action plans
with some accuracy. GPT was able to give human-like response to replace the expensive and time-consuming
user studies (Sekulić et al. (2022)). Although the generated synthetic data was proved to exhibit less language
variability, it may lead tobetter classification results when few data is availableand resources are limited (Meyer
et al. (2022)).
2.4 Recent study has use Large Language Models to simulate the online social interactions. LLMs are used to pre-
dict the popularity on the online social media (Park et al. (2022)), the distribution of votes for elections, the
sentiment of news, and the voting of justices (Hamilton (2023)). In the work of Park et al. (2022), authors pre-
sented a social simulacra tool, SimReddit, which is a prototyping technique generating social behaviours using
large language models. With the description of a community’s design, the social simulacra produce appropriate
simulated behavior.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Reinforcement Learning
2.5 In 1898, reinforcement learning were laid by Thorndike (1898) through a cat behavior experiment employing a
trial-and-error(TE) precedure. Originating from animal learning in psychology, RL has the capacity to mimic the
human-like action selection. In 1989, Watkins & Dayan (1992) significantly advanced the field by integrating the
theory of optimal control with temporal-dierence(TD) learning, leading to the development of Q-learning. To
address the curse of dimensionality in Q-learning, Mnih et al. (2015) combined deep learning with RL to create
the deep Q-network(DQN), which notably surpassed professional human players in a range of 49 classic Atari
games.
2.6 Recent advancements have extended RLs application across various domains in the natural science, social sci-
ences and engineering (Silver et al. (2016), Luo et al. (2017)). In the work of Sert et al. (2020), authors show that
the combination of RL and ABM can create an artificial environment for policy makers to observe potential and
existing behaviors associated to rules of interactions and rewards.
Simulated Environment
Opinion Elicitation
Link Update
Content Generation
Agent
Follow Link
Evolution Process
Post
Figure 1: Simulated Environment Graph: ’Post’ is a small natural language message, formatting like Tweet. Par-
ticipants within this environment, referred to as agents, are interconnected through the ’follow’ relationship
between each other. Evolution process shows the development of the whole network from tto t+ 1. The pro-
cess can be segmented into three part: Post Generation, Opinion Elicitation and Link Update.
Simulated Environment
Environment
3.1 In this section, we introduce the development of a simulated environmentthat replicates the dynamics of group
discussions on online social media platforms. Discussions in this environment revolve around the topic, defined
by posts that represent divergent viewpoints within the conversation.
3.2 The simulated environment system graph is shown in Figure 1. We will explain each part of the system step by
step.
3.3 Within this simulated setting, a post is a small natural language message, formatting like Tweet. These posts
are contributing to the whole discussion, each either supporting or opposing the central topic.
3.4 For instance, in our experiment, we initiate discussions using the topic:’Society with no gender’. Posts such as
’This incorrectly supposes that some current biological sex traits aren’t functional. I think a society without
gender is a really bad idea. are categorized as con’ post to the topic. Conversely, posts like ’The absence of
gender constructions would enable people to enjoy a better overall quality of life. support the topic, as a ’pro’
post.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
3.5 Participants within this environment, referred to as agents’, are interconnected through the ’follow’ relation-
ship between each other. Follow relationship is unidirectional. Each agent generates a post POSTt
iat time t,
influenced by their existing follow links and previous posts. Consequently, if agentifollows agentjat time t,
then POSTt
jpotentially impacts POSTt+1
i.
3.6 At any given timet, POSTt
ifully determines the agent is opinion OPINt
i. The network of follow links between
agents is then dynamically updated in response to these evolving OPINt+1.
3.7 The process can be segmentedinto three part: Post Generation, Opinion Elicitation and Link Update, as outlined
in the following Algorithm1.
Algorithm 1 Simulated environment algorithm
for t[1 . . . tmax]do
for iAGENTS do
locate the follow link of agent i
generate the post of agent i(POSTt
i);
elicit the opinion of agent i(OPINt
i);
end for
update the follow link of AGENTS;
end for
Post Generation
3.8 This section explores the mechanics of large language models in content generation technology. These models
predict the next-word in a sequence with a probabilistic approach given the prompt input.
3.9 The objective of the post generation stage is to generate the possible post of agent at the given time. This
process incorporates the agent’s previous posts, and those from their followees.
3.10 The generated post must be concise a short sentence formatting like Tweet. Therefore we use the natural lan-
guage description to describe the follow relationship, choose the format of web api including a designated stop
sign as (’}’) to ensure the output aligns with the expected structure.
3.11 Consider the following example: if agentifollows agentj, the prompt for generating POSTt+1
iwould be:
agentiposted tweets about a society without gender’: POSTt
i
agentisaw the following tweets about a society without gender’ on homepage{’data’:{’user’: agentj,’text’:
POSTt
j}}. "
agentishared the following tweet on this topic:{’data: {’user’: agenti,’text’:’"
The first paragraph of the prompt describe the topic and refer the previous posts from agenti. The second
paragraph then suggests agentifollows agentj, so the post of agentjwill potentially impact the next post of
agenti. If agentifollows multiple agents in the group, the prompt will include more content that agentican read
from the homepage. The third paragraph ensure agentistill discuss the same topic and generation of tweet-
like post. The name of agents, allowing for the observation of interaction between them, such as agreements
or disagreements on the topic. An example of a generated post, demonstrating such an interaction, is as follow:
"I agree with both Grace Lee and Chloe Kim. Gender is an important aspect of our identity and
should not be disregarded. However, everyone should also have the freedom to express them-
selves and live authentically. Instead of erasing gender, we should focus on promoting equality and
challenging harmful gender stereotypes. Let’s create a society where individuals are free to express
their gender identity without fear of discrimination or prejudice. #GenderEquality #BeYourself"
This response reflects incorporating with other agents and presenting a balanced perspective. It demonstrates
that the generation of posts is not only topic-relevant but also reflective of the influences in a social media
environment.
Moving to text generation process, key stages include encoding’, where the model assimilates and interprets
the input text, identifying contextual patterns, and ’decoding’, where it generates new text from this analysis.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Crucial to the flexibility of these models in the decoding phase are key parameters such as ’temperature’ and
’top_p’. Temperature controls the randomness in the prediction, influencing creativity and coherence in the
generated text. ’top_p’, controls the model to focus on a subset of the most probable next words, ensuring
relevance and precision in the output. These two parameters balance the generated content between novelty
and fidelity to the input context.
temperature top-p
Narrow 0.1 0.5
Creative 1.4 0.95
Table 1: GPT Decoding Setting
Table.1 shows the our setting for narrow case and creative case. Low temperature and low top-p lead to con-
servative, and narrow minded agents who’re sticking with the most obvious options when generating a post.
The creative setting characterize agents who are much willing to take suprising turns.
Opinion Elicitation
3.12 The opinion elicitation process gives the probability that the agent expects a pro-claim rather than a con-claim
given the perspective using perplexity calculation. The method we used is introduced in the work of Betz Betz
(2021).
3.13 Perplexity eectively measures how well a probabilistic model predicts a sample, serving as an indicator of the
model’s predictive capacity across the entire set of tokens within a corpus. In the context of natural language
processing, perplexity is defined as the exponential of the average negative log-likelihood of a given sequence
of tokens. Consider a tokenized sequence represented as S= (w0, w1, ..., wm). The perplexity of Sdenoted as
PPL(S), is computed as:
3.14
PPL(S) = exp 1
m
m
X
k=0
log pθ(wk|w<k)!(1)
3.15 In equation (1), log pθ(wk|w<k)represents the logarithm of the likelihood of the kth token in the sequence,
given all preceding tokens(denoted as w<k) as per the probabilistic model in use.
3.16 When GPT processes a sentence, it calculates the log probability of each word given the previous words in the
sentence which is log pθ(wk|w<k)in the equation.
3.17
PPLCON(POSTi
t) = PPL(SCON|POSTi
t)) (2)
3.18 Equation (2) shows how to compute PPLCON(POSTi
t). Given a pre-defined determined sentence SCON like "soci-
ety without gender is a really bad idea", the conditional perplexity shows the inverse probability of the output
of the language model. The low perplexity result suggests that the language model predicts this particular sen-
tence with a high level of certainty.
3.19
OPINi
t=PPLCON(POSTi
t)
PPLPRO(POSTi
t) + PPLCON(POSTi
t).(3)
3.20 In the equation (3) we give the definition of OPINi
t. The outcome is a float number in range [0,1] which represent
the polarity where 0 represents con and 1 represent pro. We will then categorise the number into five polarity
categories: strong con’, con’, ’neutral’, ’pro’ and strong pro’.
Link Update
3.21 Toestablish the ‘Following’ relationship between users, we adopt the premise that individuals inclined tofollow
users who express similar opinions, as suggested by Zhou et al. (2009). Once we have the categorised polarity
of each tweet, we then hypothesise that individuals with the same polarity category hold similar opinions.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Action
Reward
Target Agent
RL Loop
State
Simulated Environment
Figure 2: RL Loop. Reward: change of number of followers at each time step. State: 1)the current opinion states
of the initial accounts followed by the target agent, 2)the number of followers the target agent currently has.
Action: pre-defined five opinion polarity categories.
3.22 Therefore, we explore two link methods: a Follow Dynamics and a Follow-Unfollow Dynamics. In both cases,
if two users consistently align in the same opinion category over a few consecutive steps, they are considered
likely to follow each other, with a follow probability. In Follow-Unfollow Dynamics, the exist follow link may
also break with an unfollow probability, if the opinions of two people dier during the consecutive steps.
Reinforcement Learning
4.1 In the previous part, we have introduced the whole simulated environment. In this part, we detail the reinforce-
ment learning setting employed for the model.
Model Design
4.2 We set a target agent in this group network without any initial followers. Then the target agent can actively
choose current opinion state and post related posts. The objective is try to let the target agent get more follow-
ers within the determined time step.
4.3 The reinforcement learning model utilized in this framework is based on the q-learning algorithm, a widely
used method in machine learning for environments characterized by uncertainty. This technique is used for
developing optimal strategies through learning from interactions within such environments.
4.4 The RL Loop structure is shown in Figure 2. The target agent interacts with the previous introduced simulated
environment. At each time step, the agent observes the state of the environment, selects an action, receives a
reward from the environment, leading to a new environmental state.
4.5 The reward mechanism for this model is the change of number of followers at each time step, encouraging
strategies that enhance follower acquisition.
4.6 The observed state comprises two main elements: 1)the current opinion states of the initial accounts followed
by the target agent, 2)the number of followers the target agent currently has.
4.7 The action space available to the agents is pre-defined five opinion polarity categories: ’strong con, ’con’, ’neu-
tral’, ’pro’ and ’strong pro’. These categories correspond to actual post content, mapped through a dataset-
generated dictionary. Upon choosing a category, the target agent randomly selects a tweet from that category.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Algorithm 2 Reinforcement Learning Algorithm
Initialize Q-values Q(s, a)arbitrarily for all state-action pairs (s, a).
for each episode do
initialize state s
for t[1 . . . tmax]do
target agent choose action afrom state susing policy derived from Q.
for iAGENTS do
generate the post of agent i(POSTt
i);
elicit the opinion of agent i(OPINt
i);
end for
update the follow link of AGENTS;
observe reward r, and new state s.
Update Q-value for (s, a)
Update state ss.
end for
end for
(a) Gender (b) Drug
Figure 3: Distribution of polarity value.
Reinforcement Learning Algorithm
4.8 Referencing Algorithm 2, the process begins with the initialization of Q-values for all state-action pairs. In each
episode, the state is initialized, and the agent embarks on a series of time steps. During each time step, the
target agent selects an action from its current state, based on a policy derived from the Q-values. Alongside,
each agent in the system generates posts and forms opinions. The algorithm then updates the follow links
among agents and processes the resulting reward and new state. This leads to an update in the Q-value for the
chosen action, and the agent’s state is updated accordingly.
4.9 The Q-value update is represented by the formula:
4.10
Q(s, a)Q(s, a) + αhr+γmax
aQ(s, a)Q(s, a)i
4.11 In this formula, αrepresents the learning rate, γthe discount factor, rthe reward, and Q(s, a)the maximum
estimated future reward. This update rule is key to the learning process, as it adjusts the Q-values based on the
reward received and the potential future rewards, leading to more informed and strategic decision-making by
the agent in subsequent steps.
Experiment
Simulated Environment Initialisation
5.1 The initiation of our simulated environment begins with the creation of initial posts, for which we use the
dataset released by Betz (2021). The dataset is crawled from debating platform kialo.com, comprises various
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
posts structured around specific topics. For our simulation, we select posts related to the topic ’Gender’ and
topic ’Drug’.
5.2 Employing the opinion elicitation process, we analyze the datasets to determine the polarity distribution. As
shown in Figure 3, the distributions of polarity value in both cases cluster around the middle range. This distri-
bution is indicative of the varying degrees of support or opposition to the topic within this dataset.
5.3 Given the polarity distributions, we categorise the posts into distinct opinion groups, aiming for an even distri-
bution across these categories. The categorisations is shown in table 4 and table 5.
Simulated Environment Analysis
5.4 To examine the opinion evolution process within our simulated environment, we run two 50-step simulations
using both Narrow and Creative agent setting using Gender topic. These simulations aim to demonstrate how
dierent agent settings can influence the dynamics of the discussion.
A. Creative
T=0 T=25 T=50
Opinion
Time
A. Narrow
T=0 T=25 T=50
Figure 4: Opinion Evolution Visualization on Narrow setting and Creative setting
5.5 Figure 4 illustrates the dynamics of opinion evolution under the influence of two simulation configurations: the
Narrow setting and the Creative setting. Table 1 provides the specific Temperature and Top-p settings used for
the opinion dynamics simulation.
5.6 Opinion evolution visualizations show the trajectory of opinion polarity score for 20 agents over 50 steps on
Narrow setting and Creative Setting. Initially, the agents’ opinions range from 0.3 to 0.8. In the Narrow setting,
we observe that opinion density increases over time, showing a tendency toward opinion convergence. Con-
versely, in the Creative setting, the opinion density remains more dispersed and chaotic throughout the entire
time scale, with less apparent convergence.
5.7 To provide a more detailed view of opinion dynamics, opinion distributions at Time 0, 25, and 50 are presented.
In the Creative setting, despite the initial high degree of opinion diversity, aer 25 steps, some grouping starts
to occur, but full convergence is not reached even by step 50. The discussion remains dynamic, with polarity
scores fluctuating in the noisy range between 0.4 and 0.7. On the other hand, in the Narrow setting, while there
is significant opinion diversity in the first 25 steps, this diversity gradually decreases, and by step 50, opinions
cluster in the 0.45 to 0.6 range, indicating a clearer sign of convergence.
5.8 In both scenarios, the distribution of the final polarity scores shows a slight skew towards higher values across
the time steps. However, in the Narrow setting, the opinions form a more centralized distribution by step 50, in-
dicating stronger clustering and a more definitive convergence of opinions, as compared to the Creative setting,
which remains more dispersed throughout the simulation.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
(a) Follow Dynamics, Part-Observable (b) Follow Dynamics, Full-Observable
(c) Follow-Unfollow Dynamics, Part-Observable (d) Follow-Unfollow Dynamics, Full-Observable
Figure 5: Learning Curves of RL in Gender case, with dierent language model settings, link update methods,
and observability settings.
Reinforcement Learning Setting
5.9 In this section, we explore the application of RL within our simulated environment. The primary research ques-
tion is: Can RL algorithms identify an optimal strategy that enables the target agent to maximize its follower
count within a predetermined number of steps?
5.10 By testing dierent setting, we found that limiting the action space and incorporating self-observation are key
factors for stable RL result.
5.11 By restricting the action space, we reduce the complexity of the decision-making process, enabling the RL agent
to focus on a more manageable set of actions. This helps prevent the model from exploring irrelevant or subop-
timal actions that could introduce noise and instability into the learning process. Additionally, a smaller, well-
defined action space allows for faster convergence, as the agent can more eiciently learn the consequences of
its actions within the environment.
5.12 Incorporating self-observation further enhances stability by allowing the agent to continuously assess its own
state and progress. This self-awareness helps the agent make more informed decisions, as it can adjust its be-
havior based on how past actions have aected its follower count and other relevant metrics. Self-observation
also promotes consistency, as the agent is better equipped to identify patterns in the environment and avoid
erratic or random behaviors. Together, limiting the action space and incorporating self-observation contribute
to generating more stable and reliable outcomes, ensuring that the RL agent can eectively optimizeits strategy
within the defined constraints of the simulation.
5.13 To investigate stable RL results, we conducted tests across eight scenarios.
1. Language Model Settings:
We employed two types of settings for the language models of agents, as delineated in Table 1:
Narrow and Creative. These settings impact the nature of posts generated by the agents.
2. Link Update Methods:
Follow Dynamics: In this case, the follow rate is 0.8, and the unfollow rate is 0. This setting allows
us to observe the growth of the target agent’s network under a scenario where new follow links are
relatively likely to form.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
(a) Follow Dynamics, Part-Observable (b) Follow Dynamics, Full-Observable
(c) Follow-Unfollow Dynamics, Part-Observable (d) Follow-Unfollow Dynamics, Full-Observable
Figure 6: Learning Curves of RL in Drug case, with dierent language model settings, link update methods, and
observability settings.
Follow-Unfollow Dynamics: In this more dynamic setting, the follow rate is still 0.8, and the unfollow
rate is set as 0.5. This case allowing existing follow links to break, thus adding a layer of complexity
to the network dynamics.
3. Observability Settings:
Part-Observable Case: In this scenario, the target agent initially follows only one other agent.
Full-Observable Case: Contrasting the Part-Observable setting, in this case, the target agent begins
by following every agent in the network.
For the reinforcement learning, we use Gender case and Drug case, set number of episodes 500, max of steps
per episode as 5. Other parameters are shown in Appendix-Parameters.
Reinforcement Learning Result
5.14 This subsection presents the outcomes of our RL experiments conducted across the eight scenarios. Figure
5 illustrates the results for the Gender scenario and Figure 6 shows the results for the Drug scenario. In both
figures, the colored lines represent the average reward of five repeated experiments, and the shaded areas
indicate the variance in the results.
5.15 In all scenarios, we observe convergence in the reward learning curves, indicating that the RL algorithm suc-
cessfully identified optimal solutions within the given settings. Both the Gender and Drug cases exhibit similar
learning curve patterns under dierent conditions, highlighting consistent performance across varying con-
texts.
5.16 One notable observation is that all Follow-Unfollow Dynamics cases exhibited greater variance in their learning
curves compared to the Follow Dynamics counterparts under same setting. This implies that more complex
scenarios pose greater challenges in reaching optimal results.
5.17 Cases with Full-Observable settings demonstrates more stable learning curve than the part-observable ones.
This suggests that target agent with complete network visibility are likely to learn strategies more eectively
and with greater stability than those with limited visibility.
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
5.18 The Narrow and Creative settings, when applied to the Partially-Observable cases, presented distinct learning
curves. The Narrow setting consistently outperformed the Creative setting, as evidenced by higher reward val-
ues. However, in the Full-Observable scenarios, the learning curves for both Narrow and Creative settings align
closely. A possible explanation is that, in Part-Observable scenarios, the unpredictable nature of agents in the
Creative setting, leading to lower performance compared to the Narrow setting. Conversely, in Full-Observable
scenarios, the agents are better able to adapt to and anticipate the Creative setting’s variable patterns.
5.19 In summary, these results highlight the impact of observability and complexity in reinforcement learning en-
vironments, with Full-Observable settings and simpler dynamics facilitating more stable and eicient learning
outcomes. The Narrow setting’s predictable nature resulted in higher performance than the unpredictable Cre-
ative setting. However, when observability was not a constraint, as in the Full-Observable scenarios, both set-
tings showed similar performance. This suggests that with full visibility, agents can eectively adapt to even
unpredictable dynamics.
Conclusion
6.1 In this section, we reflect on our contribution, future work, ethical consideration, and limitations.
6.2 We have introduced a novel application of RL within a simulated discussion environment to model individ-
ual behavior in the context of social media. Our findings show that limiting the action space and incorporat-
ing self-observation are critical factors for generating stable results in dynamic environments. These insights
demonstrate the adaptability of the RL model to diverse conditions and its eectiveness in identifying optimal
strategies. Additionally, the RL results oer a deeper understanding of both individual and collective behavior
on digital communication platforms.
6.3 Our future work would be directed towards enhancing the realism and variability of our simulated social media
environment. This may involve incorporating more sophisticated interaction rules to better mirror the real-
world social media dynamics. Additionally, we plan to discover how dierent character profiles of agents per-
form and interact within the network. This exploration aims to provide a deeper understanding of the varied
behaviors and their impacts on the overall dynamics of social media interactions.
6.4 The potential application would be providing valuable insights for digital marketing strategists and policymak-
ers in the dynamics of online opinion formation and influence.
6.5 However, we acknowledge that one potential negative social impacts could be the misuse of the simulated
environment in manipulating public opinion. If used unethically, the technology may cra highly persuasive
and targeted misinformation campaigns on social media platforms. Conversely, this technology can also be
instrumental in detecting malicious activities in social networks.
6.6 One limitation of our research is the inherent unpredictability of actual social media discussions. Social dynam-
ics are complex and unpredictable, and our rule-based simulated environment cannot fully replicate. Although
the pre-trained RL algorithm demonstrates promising performs in the simulated environment, it presents one
version of the actual possible outcome. Despite this, our research provides valuable insights intothe mechanics
of social interaction and opinion dynamics in online groups.
Appendix A: GPT API Setting
Creative Narrow
max_tokens 2000 2000
temperature 1.4 0.1
top_p 0.95 0.5
n 1 1
stop "}" "}"
model "gpt-3.5-turbo" "gpt-3.5-turbo"
Table 2: GPT Creative Parameters
Table 2 outlines the parameters set for the GPT API, distinguishing between creative’ case and ’narrow’ case
configurations. Within these settings, higher temperature and top-p values are indicative of a setup geared
towards fostering creativity. For tasks related to post generation and opinion elicitation, we employ the pre-
trained GPT-3.5 model developed by OpenAI OpenAI (2022).
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Appendix B: Reinforcement Learning Setting Parameters
Parameters for Reinforcement Learning training are listed in table 3.
learning rate 0.01
discount rate 0.99
epsilon 0.005
exploration rate 0.01
Table 3: RL Training Parameters
Appendix C: Opinion Example
In this section, we will explore illustrative examples of strong pro’ and strong con’ concerning the concept of a
’society with no gender.’ These examples are meant to elucidate the polarized viewpoints on this topic.
A ’strong pro’ argument is presented as follows:
"Any violent behavior targeted at the traits now associated with gender would be illogical, because
that would assume that gender stereotypes would persist without gender. Such an assumption
would almost certainly also include the academically false assumption that gender is biological."
Conversely, a strong con perspective is showed in the following statement:
"It won’t disappear. The idea of Sexist Violence in the eyes of society might, but that’s only because
what we call Gender-based violence is just that... Violence. What you would see in a drop in Gender
based violence, there would be a spike in generalized Violence to take its place. Its only taking a
tag of something and putting it into a generalized pool of everything else. It helps no one and Dis-
benefits everybody involved. So a society without gender is a really bad idea."
Appendix D: Opinion Category Setting
Category Polarity score Number of posts
Strong pro 1 0.64 83
Pro 0.58–0.64 89
Neutral 0.51–0.58 128
Con 0.45–0.51 93
Strong con 0 0.45 93
Table 4: Gender opinion category setting
Category Polarity score Number of posts
Strong pro 1 0.55 101
Pro 0.48–0.55 128
Neutral 0.42–0.48 145
Con 0.37–0.42 102
Strong con 0 0.37 184
Table 5: Drug opinion category setting
References
Betz, G. (2021). Natural-language multi-agent simulations of argumentative opinion dynamics. arXiv preprint
arXiv:2104.06737
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
Briguglio, L., Nesse, P.-J., Di Giglio, A., Occhipinti, C., Durkin, P. & Markopoulos, I. (2021). Business value and
social acceptance for the validation of 5g technology. In 2021 IEEE International Mediterranean Conference
on Communications and Networking (MeditCom), (pp. 132–137). IEEE
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A. et al. (2020). Language models are few-shot learners. Advances in neural information processing
systems,33, 1877–1901
Hamilton, S. (2023). Blind judgement: Agent-based supreme court modelling with gpt. arXiv preprint
arXiv:2301.05327
Jackson, J. C., Rand, D., Lewis, K., Norton, M. I. & Gray, K. (2017). Agent-based modeling: A guide for social
psychologists. Social Psychological and Personality Science,8(4), 387–395
Luo, B., Liu, D. & Wu, H.-N. (2017). Adaptive constrained optimal control design for data-based nonlinear
discrete-time systems with critic-only structure. IEEE Transactions on Neural Networks and Learning Systems,
29(6), 2099–2111
Meyer, S., Elsweiler, D., Ludwig, B., Fernandez-Pichel, M. & Losada, D. E. (2022). Do we still need human as-
sessors? prompt-based gpt-3 user simulation in conversational ai. In Proceedings of the 4th Conference on
Conversational User Interfaces, (pp. 1–6)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland,
A. K., Ostrovski, G. et al. (2015). Human-level control through deep reinforcement learning. nature,518(7540),
529–533
OpenAI (2022). Gpt-3.5. https://openai.com/
Park, J. S., Popowski, L., Cai, C., Morris, M. R., Liang, P. & Bernstein, M. S. (2022). Social simulacra: Creating
populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on
User Interface Soware and Technology, (pp. 1–18)
Sekulić, I., Aliannejadi, M. & Crestani, F. (2022). Evaluating mixed-initiative conversational search systems via
user simulation. In Proceedings of the Fieenth ACM International Conference on Web Search and Data Mining,
(pp. 888–896)
Sert, E., Bar-Yam, Y. & Morales, A. J. (2020). Segregation dynamics with reinforcement learning and agent based
modeling. Scientific reports,10(1), 11771
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M. et al. (2016). Mastering the game of go with deep neural networks and tree
search. nature,529(7587), 484–489
Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. The
Psychological Review: Monograph Supplements,2(4), i
Watkins, C. J. & Dayan, P. (1992). Q-learning. Machine learning,8, 279–292
Zhou, H., Zeng, D. & Zhang, C. (2009). Finding leaders from opinion networks. In 2009 IEEE International Con-
ference on Intelligence and Security Informatics, (pp. 266–268). IEEE
JASSS, xx(x) x, 20xx http://jasss.soc.surrey.ac.uk/xx/x/x.html Doi: 10.18564/jasss.xxxx
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Scarcity of user data continues to be a problem in research on conversational user interfaces and often hinders or slows down technical innovation. In the past, different ways of synthetically generating data, such as data augmentation techniques have been explored. With the rise of ever improving pre-trained language models , we ask if we can go beyond such methods by simply providing appropriate prompts to these general purpose models to generate data. We explore the feasibility and cost-benefit trade-offs of using non fine-tuned synthetic data to train classification algorithms for conversational agents. We compare this synthetically generated data with real user data and evaluate the performance of classifiers trained on different combinations of synthetic and real data. We come to the conclusion that, although classifiers trained on such synthetic data perform much better than random baselines, they do not compare to the performance of classifiers trained on even very small amounts of real user data, largely because such data is lacking much of the variability found in user generated data. Nevertheless, we show that in situations where very little data and resources are available, classifiers trained on such synthetically generated data might be preferable to the collection and annotation of naturalistic data.
Article
Full-text available
Societies are complex. Properties of social systems can be explained by the interplay and weaving of individual actions. Rewards are key to understand people’s choices and decisions. For instance, individual preferences of where to live may lead to the emergence of social segregation. In this paper, we combine Reinforcement Learning (RL) with Agent Based Modeling (ABM) in order to address the self-organizing dynamics of social segregation and explore the space of possibilities that emerge from considering different types of rewards. Our model promotes the creation of interdependencies and interactions among multiple agents of two different kinds that segregate from each other. For this purpose, agents use Deep Q-Networks to make decisions inspired on the rules of the Schelling Segregation model and rewards for interactions. Despite the segregation reward, our experiments show that spatial integration can be achieved by establishing interdependencies among agents of different kinds. They also reveal that segregated areas are more probable to host older people than diverse areas, which attract younger ones. Through this work, we show that the combination of RL and ABM can create an artificial environment for policy makers to observe potential and existing behaviors associated to rules of interactions and rewards.
Article
Full-text available
Agent-based modeling is a longstanding but under-used method that allows researchers to simulate artificial worlds for hypothesis testing and theory building. Agent-based models (ABMs) offer unprecedented control and statistical power by allowing researchers to precisely specify the behavior of any number of agents and observe their interactions over time. ABMs are especially useful when investigating group behavior or evolutionary processes, and can uniquely reveal non-linear dynamics and emergence—the process whereby local interactions aggregate into often-surprising collective phenomena, such as spatial segregation and relational homophily. We review several illustrative ABMs, describe the strengths and limitations of this method, and address two misconceptions about ABMs: reductionism and “you get out what you put in.” We also offer maxims for good and bad ABMs, give practical tips for beginner modelers, and include a list of resources and other models. We conclude with a 7-step guide to creating your own model.
Article
Full-text available
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Article
Reinforcement learning has proved to be a powerful tool to solve optimal control problems over the past few years. However, the data-based constrained optimal control problem of nonaffine nonlinear discrete-time systems has rarely been studied yet. To solve this problem, an adaptive optimal control approach is developed by using the value iteration-based Q-learning (VIQL) with the critic-only structure. Most of the existing constrained control methods require the use of a certain performance index and only suit for linear or affine nonlinear systems, which is unreasonable in practice. To overcome this problem, the system transformation is first introduced with the general performance index. Then, the constrained optimal control problem is converted to an unconstrained optimal control problem. By introducing the action-state value function, i.e., Q-function, the VIQL algorithm is proposed to learn the optimal Q-function of the data-based unconstrained optimal control problem. The convergence results of the VIQL algorithm are established with an easy-to-realize initial condition Q (0) (x, a) ≥ 0. To implement the VIQL algorithm, the critic-only structure is developed, where only one neural network is required to approximate the Q-function. The converged Q-function obtained from the critic-only VIQL method is employed to design the adaptive constrained optimal controller based on the gradient descent scheme. Finally, the effectiveness of the developed adaptive control method is tested on three examples with computer simulation.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.