Content uploaded by Luca Virgili
Author content
All content in this area was uploaded by Luca Virgili on Jan 03, 2021
Content may be subject to copyright.
Investigating Reddit to detect subreddit and author stereotypes and
to evaluate author assortativity
Francesco Cauteruccio1, Enrico Corradini2, Giorgio Terracina1, Domenico Ursino2∗, Luca
Virgili2
1DEMACS, University of Calabria,
2DII, Polytechnic University of Marche,
∗Corresponding author
{cauteruccio, terracina}@mat.unical.it; {e.corradini, l.virgili}@pm.univpm.it;
d.ursino@univpm.it
Abstract
In recent years, Reddit has attracted the interest of many researchers due to its popularity all
over the world. In this paper, we aim at providing a contribution to the knowledge of this social
network by investigating three of its aspects, interesting from the scientific viewpoint, and, at the
same time, by analyzing a large number of applications. In particular, we first propose a definition
and an analysis of several stereotypes of both subreddits and authors. This analysis is coupled
with the definition of three possible orthogonal taxonomies that help us to classify stereotypes in
an appropriate way. Then, we investigate the possible existence of author assortativity in this
social medium; specifically, we focus on co-posters, i.e., authors who submitted posts on the same
subreddit.
Keywords: Reddit; Author Stereotypes; Community Stereotypes; Assortativity; Social Network
Analysis; Subreddit Lifespan
1 Introduction
Reddit1is a heterogeneous crowd-sourced news aggregator and online social platform, originally self-
declared as “the front page of Internet”. It was founded in 2005 and, in few years, has become an
ecosystem of 430M+ average monthly active users2. At the time of writing, it ranks 19th and 5th in
the Alexa’s top 500 global and US websites, respectively3. Reddit is built on the concept of subreddit,
which is an interest-based community where users can post and comment contents. A subreddit is
identified by a name and is referred to using the /r/ prefix within Reddit, such as /r/science and
1https://www.reddit.com
2https://www.redditinc.com
3https://www.alexa.com/topsites
1
/r/cats. Currently, there are more than 1.9M subreddits4. They are mainly topical, although more
general cases exist.
In Reddit, users can submit contents in the form of texts, images and links to external resources.
Submitted contents (also simply called posts) can be read by other users and discussed via comments.
Users can subscribe to multiple subreddits in order to receive the latest posted contents on their
front pages. An important feature of Reddit is voting, which represents the mechanism affecting the
visibility and the ranking of both posts and comments. In fact, users are allowed to upvote or downvote
posts of other users, so that each submission has a score. This is a metric based on the difference
between the number of upvotes and the number of downvotes, and it significantly affects the order
through which posts and comments are shown to users. However, the exact numbers of upvotes and
downvotes are not shown publicly.
Due to the great expansion of Reddit in the latest years, many researchers all over the world
have been attracted by this social platform. An overview of the studies on Reddit can be found in
[35], whereas an interesting longitudinal analysis on the evolution of this social medium is presented
in [43]. Authors have analyzed, and are continuously analyzing, many aspects of Reddit, ranging
from community structures and interactions [46, 18, 22] to user behavior [12, 27], from the analysis
of the structure and content of subreddits, posts and comments [42] to the analysis of the structural
properties of Reddit when it is seen as a social network [22]. Other specific topics, such as text
classification [30], user migration [38], political and ideological aspects [24], have been also studied.
In this paper, we aim at providing a contribution in the knowledge of Reddit by investigating
subreddit and author stereotypes and by evaluating author assortativity in this social platform.
The term “stereotype” comes from the combination of two Greek words, namely “stereos” (i.e.,
solid) and “typos” (i.e., impression). It is adopted to indicate a popular belief about specific groups
of individuals. This term first appeared in the press at the end of the 18th century. Later, it was
introduced into modern psychology at the beginning of the 20th century by Walter Lippman [31]. The
tendency to classify people into groups and to associate each group with a “general idea”, a “label”
(and, ultimately, a stereotype) is intrinsic to the human mind. As a result, many (both positive and
negative) stereotypes have been defined in the history of humanity, in the most disparate areas. Think,
for instance, of the stereotypes coined in sport, art, literature, and so on. With the capillary spread
of the Web, the practice of coining and using stereotypes has extended from real life to Cyberspace
[20, 28]. As the Web became increasingly interactive, with the transition to the Web 2.0 and, above all,
with the appearance of social networks, the adoption of stereotypes in the Cyberspace become more
and more evident [49, 40, 17, 45, 41, 9]. For example, in Facebook, one can encounter stereotypes
like “Lime-Lighters”, “Emo’s”, “Philosophy Majors”, “Hopeless Romantics”, “Ghosts”, “Stalkers”,
“Addicts”, and so forth [2]. Similarly, Instagram also presents a wide range of stereotypes [1]. We
argue that stereotypes do not necessarily have a negative meaning, as it often happens in real life.
On the contrary, they can be extremely useful in everyday communications and interactions in social
networks. In this paper, we want to go one step further; in fact, we claim that it is possible to define
“scientific” stereotypes that could be used in scientific applications. We also believe that Reddit fits
well for our goal and that, in this context, besides defining stereotypes for the authors of Reddit, it is
4https://redditmetrics.com/history
2
possible to also introduce stereotypes for subreddits.
The concept of “assortativity” or “assortative mixing” in a social network was introduced in a
famous paper of Newman [39]. It is strictly related to the concept of homophily [34] and indicates a
network node’s predilection to relate to other nodes that are somewhat similar. Several possible sim-
ilarities could be considered in assortativity, but the most investigated one is node degree. Newman
focused on degree assortativity and defined a network as assortative if its nodes having many connec-
tions tend to be connected to other nodes with many connections. He showed that social networks
are often assortatively mixed, whereas technological and biological networks tend to be disassortative.
After Newman, other authors investigated assortativity in several social networks, such as Facebook
[10], Twitter [8], Cyworld, Orkut and MySpace [3]. They found that: (i) Cyworld is disassortative
with respect to friendship and very assortative with respect to the “testimonial” relationship; (ii)
Orkut is assortative with respect to friendship; (iii) MySpace is neutral with respect to the same re-
lationship; (iv) Twitter is strongly assortative with respect to shared interests of users; (v) Facebook
is assortative with respect to the tendency of a bridge (i.e., a user joining more social networks) to
communicate with other bridges. In this paper, we extend the assortativity analysis to Reddit, which
was only marginally considered in the past studies about this topic. We first consider degree assorta-
tivity because it is the most studied one in the past. Then, we also analyze eigenvector assortativity.
We show that Reddit is assortative with respect to both these centralities, which confirms that also
this social platform follows the hypotheses of Newman concerning the existence of assortative mixing
in social networks.
The significance and value of this paper concern both the theoretical and the application view-
points. From the theoretical point of view, this is the first paper that studies the concept of stereotype
in Reddit; actually, approaches for the characterization and identification of specific traits of users have
been independently presented in different scientific works: users showing multi-community engagement
[46], anti-social behaviors [18], community opposers [29], “answer-persons” [12], and “explorers” [26]
are some examples. It is also the first paper that proposes a study on the concept of assortativity in
Reddit. In fact, this concept had been investigated for a wide variety of social platforms in the past,
such as Facebook [10], Twitter [8], Cyworld, MySpace and Orkut [3], but no author had been involved
in analyzing it in Reddit. Instead, as far as the application point of view is concerned, we highlight
that the knowledge patterns on stereotypes and author assortativity extracted in this paper can be
employed in a large variety of contexts. Just to cite a few of them, we mention: (i) the definition of
some guidelines to follow in order to make a subreddit successful; (ii) the definition and realization of
different categories of recommender systems for Reddit; (iii) the definition of an algorithm that finds
subreddits to merge or, at least, to integrate; (iv) the detection of possible targets for an advertising
campaign; (v) the definition and implementation of different categories of recommender systems; (vi)
the definition of an algorithm that builds blacklists of users based on author stereotypes.
The outline of this paper is as follows. In Section 2, we describe related literature. In Section 3, we
present an overview of our investigation activity and describe the dataset adopted in our experiments.
In Section 4, we present several preliminary analyses concerning posts, comments and users in Reddit.
In Section 5, we illustrate the activities performed to detect subreddit stereotypes and to determine
their features. In Section 6, we describe the same tasks but performed to detect author stereotypes.
In Section 7, we analyze author assortativity in Reddit. In Section 8, we present a discussion of
3
obtained results. In Section 9, we describe some possible applications of the knowledge we extracted
in the previous sections. Finally, in Section 10, we draw our conclusions and have a look at future
developments concerning our research.
2 Related work
The study of social networks has rapidly become a core research field, thanks to its interdisciplinary
aspects [32, 15, 19, 4, 15, 7, 13]. Indeed, many researchers of different disciplines, such as computer
scientists, sociologists and anthropologists, exhibited a huge interest in social network analysis [33, 11,
14]. In this context, Reddit is an invaluable source of information, insights and research possibilities.
Indeed, it is a prosperous environment, where users share contents and interact with each other. The
heterogeneous nature of Reddit, together with the openness and the richness of its data, encouraged
scientific community to explore the twists and turns of this platform.
The swift increase of scientific literature related to Reddit has produced a discrete number of papers
with several goals and methodologies. In [35], the authors present an overall survey on Reddit, which
illustrates several studies on this social network, spanning in time from 2005 to 2018. An interesting
longitudinal analysis on the evolution of Reddit is presented in [43].
As pointed out in the Introduction, one of the main theoretical contributions of this paper is the
study of the concept of author stereotype in Reddit, and the definition and characterization of several
stereotypes of interest. As a matter of fact, in past literature, approaches for the characterization and
identification of specific traits of users have been presented in different papers. Some of the considered
traits are: users presenting multi-community engagement [46], anti-social behaviors [18], community
opposers [29], “answer-persons” [12], and “explorers” [26]. The main contribution of our work with
respect to these proposals is a systematic study of several traits of users, which are summarized in a
wide spectrum of stereotypes and in a suitable classification of them.
In more detail, the “multi-community interaction” trait is studied in [46], where the authors analyze
the evolution of communities in which users post in their Reddit “life”. They find out that, actually,
Reddit users continually post in new communities; in fact, those who leave a community are intended
to do so from the very early beginning of their history. Social and anti-social behaviors are analyzed in
[18], where the authors apply a definition that extends Brunton’s construct of spam in order to separate
norm-compliant behaviors from norm-violating ones. This approach also investigates inter-community
conflicts by associating social and anti-social homes to users. Conflicts between users are also studied in
[29], but from a different point of view. Here, the authors analyze inter-community interactions across
36,000 communities and focus on cases where users of one community, driven by a negative sentiment,
submit comments in another community. They highlight how such conflicts actually emerge from a
very small number of communities and discuss on strategies for predicting conflicts and mitigating their
negative impacts. The presence of users showing the trait of “answer-person” in Reddit is explored in
[12], where the authors define an automated method based on user interactions for identifying this role,
yet avoiding expensive content analysis. Finally, in [26], the authors present a study regarding highly
related communities; in this analysis, they define the characteristics of explorers and non explorers by
adopting a specific taxonomy.
The studies and approaches outlined above have been developed considering several communities
4
and subreddits. In [27], a specific subreddit about online User Experience (/r/userexperience) is stud-
ied. Here, members socialize and learn together. The authors of this study identify five distinct social
roles, namely the “knowledge broker” (i.e., a member that introduces knowledge to the community
by sharing links), the “translator” (i.e., a member that offers her academic knowledge into the com-
munity), the “conversation facilitator”, the “experienced practitioner”, and the “learner”. Even if the
contribution of [27] is particularly interesting because it considers several facets of users’ characteriza-
tion (and, for this feature, it is similar to our work) these classes are specific and valid for the analyzed
community only. On the contrary, author stereotypes introduced in our approach cover a wide range
of possible facets of users’ behavior, with no limitation on the kind and amount of subreddits the users
interact with.
As a final remark about stereotyping in the literature, it is worth observing that our proposal
introduces both author and subreddit stereotypes. To the best of our knowledge, the definition of
subreddit stereotypes received no attention in the literature and, consequently, it represents a step
forward in the research on Reddit.
As far as this last aspect is concerned, we pointed out in the Introduction that one of the main
potential applications of subreddit stereotyping is the definition of guidelines in order to make a
subreddit successful. With respect to this topic, some papers studied how to predict the success of a
subreddit or, more generally, of a community from different perspectives. In particular, the authors
of [16] investigate the success and group dynamics of online communities, focusing on Reddit ones.
In detail, they identify four success measures desirable for most communities, spanning from the
growth of the numbers of members to the volume of activities within the community, and capturing
different kinds of success. They also investigate the prediction of the final success of a new community.
Furthermore, the authors of [48] present a broad exploration of posts, with a particular interest to
comments. Here, they aim at fulfilling three different tasks. The first is analyzing a comment thread
by looking at its topical structure and evolution; the second consists of exploiting comment threads to
enhance web search; the third aims at distilling useful features to predict the final score of a comment.
Finally, in [42], the authors investigate both the behavioral context of user posting and the polarization
of user responses.
The main difference between the above mentioned approaches and the stereotyping activity pro-
posed in this paper is that the former observe communities evolution and, possibly, predict their
success, whereas the latter could be used to provide guidelines for promoting specific actions to ob-
tain the desired success. From a data analytics point of view, the former focuses on descriptive and
predictive analytics, whereas the latter also performs diagnostic and prescriptive one.
As pointed out in the Introduction, another contribution of this paper is the study of assortativity
in Reddit. While this topic has been analyzed with reference to other social platforms [10, 3, 8], only
few works marginally analyzed it on Reddit. In particular, in [25], the authors focus on studying loyal
communities, finding that they tend to be less assortative as long as their interaction level increases.
In this case, assortativity is studied on monthly interaction networks, where users are considered
connected if they submit a comment in the same comment chain with a gap of at most two comments.
The authors also carry out a comparison with a null model and find that the difference between
loyal communities and their random counterparts disappears. This result implies that users in loyal
communities tend to interact with dissimilar users as a consequence of the community’s activity.
5
Actually, in [25], assortativity is used as a tool for characterizing loyal communities, studying single
chains of comments. On the contrary, we study assortativity from a more general point of view, in
order to provide an overall characterization of Reddit users across several subreddits and comments.
Furthermore, we study both degree assortativity and eigenvector assortativity.
Another work marginally related to our study on assortativity in Reddit is presented in [22]. Here,
the authors discuss the rise of new trends in complex networks by looking at vertices that “shine” (i.e.,
high-degree vertices), also called network stars. They study the evolution of some complex networks,
with Reddit among them. They analyze the temporal dynamics of the networks by looking at how
different features, such as density and average clustering coefficient, change over time. Clearly, [22]
and our paper are quite different. Indeed, differently from what happens in [22], our assortativity
definition does not allow the analysis of temporal dynamics, that is the main goal of [22]. On the
other side, it helps to characterize the tendency of users to associate with each others.
Other works, marginally related to our proposal, focus on the study of specific aspects of subreddits
or user behaviors. For instance, in [30], the authors use text classification and computational critical
discourse analysis to distinguish and interpret ideological differences between subreddits. In [50], the
authors present a study regarding a quantitative, language-based typology of communities’ identity,
revealing how several social phenomena manifest across communities. The introduced taxonomy is
based on two aspects of community identity, i.e., distinctiveness and dynamicity. User migration is
studied in [38]. Here, Reddit is examined during a period of community unrest in order to identify
the motivations for this kind of behavior. Political and ideological aspects emerging in Reddit are
discussed in [24, 5, 23, 44]. Finally, in [21], the authors present a mixed-method study of 100,000
subreddits and their rules in order to define effective mechanisms for community governance.
3 Overview of our investigation activity
After having defined the motivations and objectives of the analyses described in this paper in this
section we present an overview of our investigation activity. We start depicting the overall structure
of Reddit in Figure 1. In the left part of this figure, each rounded box represents a subreddit. The
central part shows a list of posts in the example subreddit /r/subreddit, where each color identifies a
different type of posts (text, image or link to an external resource). Finally, the right part illustrates
the structure of a post, including its title and its comments, which are presented as a tree having the
post as root.
Reddit /r/subreddit
…
…
/r/subreddit
Figure 1: A graphical overview of Reddit structure
6
The workflow that represents the tasks, which our investigation activity consists of, is shown in
Figure 2. Due to layout reasons, in this figure, we put the dataset as an input to the Descriptive
Analysis module only. Actually, it is to be considered as an input to all the modules of the workflow.
Similarly, descriptive knowledge patterns are also to be considered as an input to the Assortativity
Analysis module. Finally, both descriptive knowledge patterns and stereotype knowledge patterns
(analogously to what explicitly shown in Figure 2 for the assortativity knowledge patterns) are to be
considered also as outputs of our investigation activity.
Figure 2: The workflow representing the tasks of our investigation
As shown in Figure 2, the first phase of our investigation consists of a descriptive analysis of
all those features of Reddit that can affect the investigation of both stereotypes and assortativity.
We start with some preliminary investigations on Reddit data. They focus on three aspects, namely
posts submitted to subreddits, comments under these posts and, finally, users who created a subreddit,
posted or commented. The aim of this preliminary descriptive analysis was not to discover new specific
knowledge about Reddit. Instead, it allowed us to better understand the dataset, and to check if
some theoretical trends, which should have characterized these aspects on Reddit, were verified on
it. Furthermore, the results found, which were partially expected, represented the starting point of
the next knowledge detection activities. They were also useful to explain the knowledge patterns
extracted.
These knowledge patterns, together with the dataset, are given as input to the second module of
our workflow, which carries out the extraction and analysis of stereotypes. In order to extract and
analyze subreddit stereotypes, we first investigate the lifespan of a subreddit, depicting its typical
characteristics. Then, starting from this, we identify several subreddit stereotypes and, finally, we
define and apply three orthogonal taxonomies in order to characterize them. After the analysis of
subreddit stereotypes, we proceed similarly for author stereotypes. In particular, we extract several
author stereotypes and, then, we classify them according to some orthogonal taxonomies that we
define for this purpose.
The information returned by this module, especially the one extracted from the analysis of author
stereotypes, are given as input to the third module of the workflow, which deals with assortativity
analysis. We aimed at performing this analysis for Reddit authors and both degree and eigenvector
assortativity to verify if authors very active in Reddit tend to form a backbone or not. This module
first builds an appropriate social network whose nodes represent the authors and whose arcs denote
the co-posting activity. Afterwards, it performs the analysis of the assortativity of Reddit authors
against degree and eigenvector centrality. The patterns derived during this phase represent the last
7
type of knowledge pattern returned in output by our approach.
After having described the workflow representing our investigation actitivity, we now illustrate in
more detail the characteristics of our dataset. All the data required for our activity was downloaded
from the pushshift.io website, which is one of the most known Reddit data sources. Our dataset
contains all the posts published on Reddit from January 1st, 2019 to September 1st, 2019. All the
posts wrote in a month were added to the dataset at the end of the next month. The number of
posts available for our investigation was 150,795,895. For each post, we considered the following set
of attributes: id,subreddit,title,author,created utc,score,num comments and over 18.
In order to carry out our experiments, we used a server equipped with 16 Intel Xeon E5520
CPUs and 96 GB of RAM with the Ubuntu 18.04.3 operating system. We adopted Python 3.6
as programming language, its library Pandas to perform ETL operations on data, and its library
NetworkX to perform operations on networks.
During the ETL phase, we observed that some of the available posts referred to authors that had
left Reddit. We decided to remove these posts from our dataset. At the end of this last activity the
number of posts at our disposal was 122,568,630.
We computed the number of authors who submitted these posts; it was equal to 12,464,188. Then,
we found the number of the subreddits which they referred to; it was equal to 1,356,069.
4 Preliminary investigations on Reddit data
In this section, we describe some preliminary investigations that we performed on Reddit. As pointed
out in Section 3, these are not the core of our paper, but they confirmed us the suitability of our
dataset. Furthermore, some knowledge extracted here was extremely useful in the analyses described
in the next sections. We group the following analyses in three subsets, which regard posts, comments,
and authors, respectively. We describe each subset in a separate subsection.
4.1 Investigation on posts
We started this investigation by performing the following analyses on posts:
•distribution of subreddits against posts (Figure 3); it follows a power law with α= 1.651 and
δ= 0.014;
•distribution of authors against posts (Figure 4); it follows a power law with α= 1.431 and
δ= 0.016;
•distribution of posts against scores (Figure 5); it follows a power law with α= 1.600 and
δ= 0.005.
The maximum number of posts with the same score is 51,721,824. Interestingly, these posts have
associated a score equal to 1. Instead, the number of posts with a score equal to 0 or 2 is much smaller.
This trend can be explained considering that a post submitted on Reddit starts with a score of 1. As
a consequence, when no other author upvotes or downvotes it, the final score of the post is 1.
8
Figure 3: Distribution of subreddits against posts (log-log scale)
Figure 4: Distribution of authors against posts (log-log scale)
We also observe that no post has a negative score. This fact is due to Reddit that shows and
returns a score equal to 0 for a post whenever the number of downvotes is higher than the number of
upvotes, i.e., also when the real score of the post is negative. So, posts with a score equal to 0 are to
all intents and purposes intended as “negative” posts.
At this point, we also computed:
•the distribution of authors against negative posts (Figure 6); it follows a power law with α=
2.274 and δ= 0.030.
9
Figure 5: Distribution of posts against scores (log-log scale)
•the distribution of authors against positive posts (Figure 7); it follows a power law with α= 2.074
and δ= 0.014.
Figure 6: Distribution of authors against negative posts (log-log scale)
As for these two distributions, we found that the number of positive posts is about 16 times the
number of negative ones.
4.2 Investigation on comments
As for this investigation, we computed:
10
Figure 7: Distribution of authors against positive posts (log-log scale)
•The distribution of subreddits against comments (Figure 8); it follows a power law with α= 1.730
and δ= 0.015.
•The distribution of the average number of comments against the scores of the posts they refer
to (Figure 9). Interestingly, in this case, we have a roughly Gaussian distribution, whose mean
is at a score near to 50,000. The distribution presents several outliers. For instance, for a score
equal to 79,470, we have a post with a number of comments equal to 71,225.
•the distribution of posts against comments (Figure 10); it follows a power law with α= 1.455
and δ= 0.011.
Finally, we considered the 150 posts with the highest number of comments and the subreddits
they were submitted to. We obtained only 31 subreddits. Then we computed the average number of
comments for all the posts submitted in each of these subreddits. The results obtained are reported
in Figure 11. From the analysis of this figure, we can observe that the distribution is very irregular.
It decreases quickly for the first three subreddits, very slowly for the next 13 subreddits, quickly for
the next 9 subreddits and, finally, it suddenly drops and becomes almost zero.
4.3 Investigation on authors
First, we determined the distribution of authors against subreddits (Figure 12). It follows a power
law with α= 1.702 and δ= 0.081.
Afterwards, we selected the 150 posts with the highest number of comments and the corresponding
authors. Interestingly, we had only 26 authors for all the 150 posts. These can be considered as the
most commented authors in Reddit and, maybe, they are influencers. Then, we computed the average
number of comments for all the posts each author submitted. The results obtained are reported in
11
Figure 8: Distribution of subreddits against comments (log-log scale)
Figure 9: Distribution of the average number of comments against the scores of the posts they refer
to
Figure 13. From the analysis of this figure we can observe that the decrease of the distribution is
roughly stepwise.
5 Stereotyping subreddits
In order to determine some possible stereotypes of subreddits, we start investigating the subreddit
lifespan. As a first step, we considered the subreddits created in January 2019 and then verified the
12
Figure 10: Distribution of posts against comments (log-log scale)
Figure 11: Distribution of the average number of comments submitted to the subreddits receiving the
150 most commented posts
month when they performed their last activity (and, therefore, presumably died). The results obtained
are reported in Figure 14. Here, an activity level of 1 implies that the subreddit died in the same
month it was born, an activity level of 2 suggests that it died one month after it was born, and so
on. An activity level of 8 indicates that it is still alive (we recall that our dataset comprises data from
January 1st, 2019 to September 1st, 2019). We proceeded in the same way for the subreddits created
in February, March, and so forth. For instance, in Figure 15, we report the trends of the subreddits
13
Figure 12: Distribution of authors against subreddits (log-log scale)
Figure 13: Distribution of the average number of comments received against the authors submitting
the 150 most commented posts
created in February 2019 and in March 2019.
After this, we focused on those subreddits died in the same month they were born. We analyzed
their corresponding lifespan and we observed that almost all of them died in the same day they were
born. For instance, in Figure 16, we report the trends of the subreddits born and died in February
2019 and in March 2019.
Then, we decided to deeply investigate those subreddits died in the same day they were born.
14
Figure 14: Lifespan of the subreddits created in January 2019
Figure 15: Lifespan of the subreddits created in February 2019 (at left) and March 2019 (at right)
Figure 16: Lifespan of the subreddits born and died in February 2019 (at left) and March 2019 (at
right)
15
We computed their distribution against the number of their posts. Figure 17 shows what happens
for January 2019; the same trend can be observed for the other months of this year. Clearly, this
distribution follows a power law, a trend that can be observed also for similar subreddits born in the
other months. From its analysis we observe that most of the subreddits, which died in the same day
they were born, have only one post. At this point, we computed the distribution of these subreddits
against the number of comments. In Figure 18, we show the subreddits of January 2019, even if the
same trend can be observed for the other months of this year. From the analysis of this figure we
can note that this distribution follows a power law. Furthermore, most of these subreddits have no
comments.
Figure 17: Distribution of the subreddits of January 2019 died in the same day they were born against
the number of their posts
Figure 18: Distribution of the subreddits of January 2019 died in the same day they were born against
the number of their comments
16
Next, we examined a second class of subreddits, similar to the previous one. In fact, we selected all
those subreddits that died one day after they were born. Again, we first computed their distribution
against the number of posts. In Figure 19, we show what happens for the subreddits of January 2019;
again, the same trend was found for all the other months. This distribution follows a power law,
which was expected. The unexpected thing was that the minimum number of posts was 2 and not
1. Even more unexpectedly, this trend is also confirmed for the subreddits with the same features
born in the other months. After that, we computed the distribution of these subreddits against the
number of comments. In Figure 20, we show it for the subreddits of January 2019; the same trend can
be observed for all the other months. From the analysis of this figure, we note that this distribution
follows a power law. Furthermore, most of these subreddits have no comments.
Figure 19: Distribution of the subreddits of January 2019 died one day after they were born against
the number of their posts
Note that the two classes of subreddits above have a proper characterization that differentiates
them from all the other classes of subreddits (for instance, the ones that survived for some months).
They also have few features distinguishing them from each other. However, the number of their
similarities is much higher than the number of their differences. As a consequence, both these two
classes can be considered as a “macro-category” of stereotypes that we call “dead in crib”. At this
point, by deepening what we have found previously, we have determined the following stereotypes
characterizing the subreddits “dead in crib” (i.e., those subreddits who died at most one day after
they were born):
•User Profile: it is associated with a user profile.
•Unsuccessful Subreddit: it initially stimulated several interactions. However, after few hours,
these interactions finished and it quickly died.
•Comment Grabber: it had at least one post capable of stimulating a debate, even if minimal.
•Private Community: it requires an invitation to be accessed. It is often associated with a specific
event of interest for a specific community.
17
Figure 20: Distribution of the subreddits of January 2019 died one day after they were born against
the number of their comments
•Banned Subreddit: it was banned probably because it was associated with a spammer.
•Bot: it can be recognized because its posts are always similar and consist of links and comments
with links.
In order to characterize these stereotypes, and all the others that we will consider in the following,
we have defined three possible orthogonal taxonomies. These are based on:
•the number of posts; we considered two possible classes, i.e., few posts and many posts;
•the number of comments; we considered two possible classes, i.e., few comments and many
comments;
•the number of authors; we considered two possible classes, i.e., few authors and many authors.
Taking these three taxonomies into consideration, the previous stereotypes can be classified as
shown in Tables 1 and 2.
Observe that a stereotype can often belong to both the classes of a taxonomy. This implies that
it cannot be “categorized” based on that taxonomy. For instance, Comment Grabber, in presence of
many comments and many authors, can be found with both few posts and many posts. This implies
that this stereotype can be characterized only by the number of comments and the number of authors,
but not by the number of posts. Analogously, in presence of many posts, Banned Subreddit cannot be
characterized by the number of comments or the number of authors. By contrast, in presence of few
posts, Banned Subreddits is characterized by few comments and few authors.
After having investigated the stereotypes of the subreddits “dead in crib”, we focused on the
opposite category of subreddits, i.e., those survived for all the months of reference for our dataset. We
collectively call them “survivors” in the following. We applied the same reasoning and tasks that we
have made for the subreddits “dead in crib” and we obtained the following stereotypes:
18
Few Authors Many Authors
Few Comments User Profile Unsuccessful Subreddit
Unsuccessful Subreddit
Banned Subreddit
Many Comments Unsuccessful Subreddit Private Community
Comment Grabber Bot
User Profile Unsuccessful Subreddit
Comment Grabber
Table 1: Classification of stereotypes concerning the subreddits “dead in crib” - Few posts case
Few Authors Many Authors
Few Comments User Profile Unsuccessful Subreddit
Unsuccessful Subreddit Bot
Banned Subreddit Banned Subreddit
Many Comments User Profile Private Community
Banned Subreddit Banned Subreddit
Unsuccessful Subreddit
Comment Grabber
Table 2: Classification of stereotypes concerning the subreddits “dead in crib” - Many posts case
•User Profile,Bot: these are the same ones we have seen for the subreddits “dead in crib”.
•Cringe / NSFW Subreddit: it contains strange or strong-content posts, submitted by only one
user, or, alternatively, it is an NSFW subreddit.
•Niche Subreddit: its topics are niche ones, and it draws the attention of users interested in them.
•Successful Subreddit.
•Big Comment Grabber: almost all the posts submitted in it stimulate a debate.
•Utility Subreddit: it is conceived to support a specific activity (think, for instance, of a subreddit
where users ask for a translation).
Based on the three taxonomies defined above, the previous stereotypes can be classified as shown
in Tables 3 and 4.
Few Authors Many Authors
Few Comments User Profile Successful Subreddit
Bot Niche Subreddit
Cringe /NSFW Subreddit
Niche Subreddit
Many Comments Successful Subreddit Big Comment Grabber
Niche Subreddit Successful Subreddit
Big Comment Grabber Niche Subreddit
Table 3: Classification of stereotypes concerning the subreddits “survivors” - Few posts case
After these analyses on the stereotypes belonging to the two extreme categories “dead in crib”
and “survivors”, we decided to apply the same reasonings and tasks to investigate a third category of
19
Few Authors Many Authors
Few Comments Niche Subreddit Cringe / NSFW Subreddit
Niche Subreddit
Many Comments Big Comment Grabber Successful Subreddit
Utility Subreddit
Table 4: Classification of stereotypes concerning the subreddits “survivors” - Many posts case
stereotypes, intermediate between the two previous ones. Specifically, we focused on those subreddits
that lived five months after their creation and, then, died. We call this category “undelivered promises”
and we obtained the following stereotypes for it:
•User Profile,Niche Subreddit,Bot,Cringe / NSFW Subreddit,Private Community,Banned
Subreddit: these are the same ones we have seen for the previous categories.
•Unsuccessful Boomer: it was successful for a while, but died after a period of decline.
•Unsuccessful Zombie: it was born without praise or blame, managed to survive for a while in a
gray way and, finally, died.
Based on the three taxonomies that we defined above, the previous stereotypes can be classified
as shown in Tables 5 and 6.
Few Authors Many Authors
Few Comments User Profile Bot
Niche Subreddit Cringe / NSFW Subreddit
Bot Niche Subreddit
Unsuccessful Boomer
Many Comments User Profile Niche Subreddit
Private Community Private Community
Unsuccessful Boomer Unsuccessful Boomer
Niche Subreddit
Table 5: Classification of stereotypes concerning the subreddits “undelivered promises” - Few posts
case
Few Authors Many Authors
Few Comments User Profile Private Community
Cringe / NSFW Subreddit Banned Subreddit
Bot Niche Subreddit
Unsuccessful Zombie
Many Comments User Profile Cringe / NSFW Subreddit
Bot Banned Subreddit
Cringe / NSFW Subreddit Unsuccessful Boomer
Table 6: Classification of stereotypes concerning the subreddits “undelivered promises” - Many posts
case
20
6 Stereotyping authors
In order to determine the possible author stereotypes, we proceeded in a way analogous to what we
have done to define subreddit stereotypes. In fact, also for authors, we found three macro-categories of
stereotypes, namely “very positive”, “neutral” and “very negative” authors. To better understand the
reasoning underlying these categories, we recall that, in Section 4.1, we have found that the number of
positive posts is about 16 times the number of negative ones in Reddit. As a consequence, it is possible
to use this result as a baseline for a preliminary author classification. Specifically, we considered an
author as “very positive” if the number of positive posts submitted by her is at least 2 ·16 = 32
times the number of negative ones, which means at least twice the typical number of positive posts
submitted for each negative one by a user. Instead, we considered an author as “neutral” if the number
of positive posts submitted by her is between 1 and 16 times the number of negative ones. Finally, we
considered an author as “very negative” if the number of negative posts submitted by her is at least 16
times the number of positive ones. Clearly, this classification is not exhaustive and it is also empirical
because it derives from our observation on the behaviors of users in Reddit. However, we feel that it is
useful to provide a first definition of three macro-categories of author stereotypes possibly interesting
for application scenarios.
Analogously to what we have done for subreddit stereotypes, we have defined two possible orthog-
onal taxonomies, namely:
•the number of posts: the possible classes are few posts and many posts;
•the number of comments: the possible classes are few comments and many comments.
Afterwards, we determined the following stereotypes characterizing the “very positive” authors,
proceeding in a way analogous to the one we adopted for subreddit stereotypes:
•Unsuccessful Author: she submits posts but she is never capable of stimulating interactions with
other authors.
•Fame Seeker: she submits (and/or she is still submitting) an impressive amount of posts in order
to reach fame in Reddit.
•Cringe / NSFW Author: she often submits cringe / NSFW posts.
•FBG Publisher (Few But Good Publisher): she does not publish a very high number of posts;
however, her posts are generally appreciated by other users.
•Content Creator: she creates and submits contents for people.
•Successful Author: she submits many posts that receive many positive comments and are appre-
ciated by other users.
•Reposter: she simply re-submits posts of other authors.
21
Few Posts Many Posts
Few Comments Unsuccessful Author Fame Seeker
Cringe / NSFW Author
Many Comments FBG Publisher Successful Author
Content Creator Reposter
Table 7: Classification of the stereotypes concerning “very positive” authors
Based on the two taxonomies that we defined above, the previous stereotypes can be classified as
shown in Table 7.
After the “very positive” authors, we focused on the opposite macro-category of author stereotypes,
i.e., the “very negative” ones. We obtained the following stereotypes, applying the same reasoning
and performing the same tasks that we made for “very positive” authors:
•Unsuccessful Author: this stereotype is the same as we have seen for “very positive” authors.
•Spammer: she is an author submitting a lot of spam posts evaluated negatively by other users.
•Hatred Sower: she is a user whose goal is attacking minority groups with hate posts or comments.
•Instigator: she is an author using every opportunity to make herself known. For her, it is not
important how she is judged, but the fact that one speaks of her.
Based on the two taxonomies defined above, the previous stereotypes can be classified as shown
in Table 8.
Few Posts Many Posts
Few Comments Unsuccessful Author Spammer
Many Comments Hatred Sower Instigator
Table 8: Classification of the stereotypes concerning “very negative” authors
After having analyzed the stereotypes belonging to the two extreme categories, i.e., “very positive”
and “very negative” authors, we decided to investigate “neutral” authors as representative of a third
macro-category, intermediate between the two previous ones. We obtained the following stereotypes,
applying the same reasoning and tasks that we made for the other two macro-categories:
•Unsuccessful Author and Fame Seeker: these stereotypes are the same ones we have seen for the
previous macro-categories.
•PP Author (Private Purpose Author): she often creates subreddits for private purposes, for
instance to talk about specific topics of interest for a particular community. Often, her subreddits
require an invitation for being accessed.
•Bot: it is a bot; it can be recognized because it always submits similar posts consisting of links
and comments with links.
22
•Moody Author: she creates subreddits and submits posts whose topics, expressed positions, and
evaluations apparently swing without a logic.
•Comment Grabber: she occasionally submits posts capable of stimulating a debate, even if
minimal.
•Big Comment Grabber: almost all the posts submitted by her stimulate a debate.
Based on the two taxonomies defined above for authors, the previous stereotypes can be classified
as shown in Table 9.
Few Posts Many Posts
Few Comments Unsuccessful Author Fame Seeker
Bot
Many Comments PP Author Moody Author
Comment Grabber Big Comment Grabber
Table 9: Classification of the stereotypes concerning “neutral” authors
7 Analyzing author assortativity
In the past, assortativity has been largely analyzed in several social media [10]. In this section, we
aim at checking if a form of assortativity exists in Reddit; in particular, we focus on co-posters, i.e.,
authors submitting posts on the same subreddit.
In order to perform our analyses, we define a support network P, which we call co-post network.
Formally speaking:
P=hN, E i
Here, Nis the set of the nodes of P; there is a node ni∈Nfor each author aiwho submitted at
least one post. There is an edge (ni, nj, wij )∈Eif the authors aiand aj(associated with the nodes
niand nj, respectively) submitted at least one post in the same subreddit. wij indicates the number
of subreddits having at least one post of aiand, simultaneously, at least one post of aj.
The number of nodes of Pis equal to the number of authors in our testbed, i.e., 12,464,188. The
number of arcs of Pis about 925 billion. The density of this network is 0.00596, whereas the average
clustering coefficient is 0.43753.
First of all, we computed the degree centrality of the nodes of P. In Figure 21, we report the
corresponding distribution. This figure shows that degree centrality follows a power law, even if
disturbed. This result is in line with the theory regarding this kind of centrality [47]. The maximum
value of degree centrality is 1,820,412, while the minimum value is 0.
We sorted the corresponding authors in a descending order, based on their degree centrality, to
verify the possible presence of a degree assortativity in Reddit. Then, we divided the sorted list into
intervals of authors. In particular, we considered equi-width intervals {I1,I2,· · · ,I40}, each consisting
23
Figure 21: Distribution of degree centrality for the nodes of P
of 312,500 authors5. As a consequence, the interval Ik, 1 ≤k≤39, contained the authors of the sorted
list comprised in the interval (312,500 ·(k−1),312,500 ·k], open at left and closed at right. The
interval I40 contained the authors comprised in the interval (12,187,500 ,12,464,188].
First of all, we considered the first interval I1and, for each interval Ik, 1 ≤k≤40, we determined
how many authors of I1are connected to at least one author of Ik. The results obtained are reported
in Figure 22(a). Then, we computed the percentage of authors of Ikconnected with at least one
author of I1. The results obtained are reported in Figure 22(b). From the analysis of Figure 22, it
is clear that a strict correlation (i.e., a sort of backbone) exists among the authors with the highest
degree centrality.
(a) (b)
Figure 22: (a) Number of authors of I1connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I1
5Actually, the last interval had a width slightly lower than the other ones.
24
In order to prove the statistical significance of our results, we generated a null model to compare
our findings with the ones obtained in an unbiasedly random scenario. Specifically, we built our null
model shuffling the arcs of P(that, in our case, represent co-posting relationships) among the nodes
of this network. In this way, we left unchanged all the original features of Pwith the exception of
the distribution of co-posting tasks, which became unbiasedly random in the null model. After that,
we repeated the previous analyses on the null model. The results obtained are reported in Figure
23. Comparing this figure with Figure 22, we can see that the distributions represented therein are
similar, in a way that many of the intervals with the highest values in Figure 22 continue to reach the
highest values in Figure 23. However, in this last case, the values are much smaller. Therefore, we can
conclude that the behavior observed in Figure 22 (and the consequent possible degree assortativity
revealed by them) is not random but it is intrinsic to Reddit.
(a) (b)
Figure 23: (a) Number of authors of I1connected to at least one author of Ikin the null model - (b)
Percentage of authors of Ikconnected to at least one author of I1in the null model
However, this is not sufficient to conclude that there is a degree assortativity for authors in Reddit.
In fact, we must check if this trend is also confirmed for the authors with an intermediate degree
centrality and for those with a low degree centrality.
Clearly, for an exhaustive analysis, we should repeat the tasks we have previously done for I1
for all intervals. Due to space constraints, we limit our analysis to the interval I20, representative of
intermediate degree centrality intervals, and I39, representative of the low degree centrality intervals6.
Figure 24(a) reports the number of authors of I20 connected to at least one author of Ik, whereas
Figure 24(b) shows the percentage of authors of Ikconnected with at least one author of I20. From
the analysis of this figure, it emerges a strict correlation between the authors with an intermediate
degree centrality.
Also in this case, we compared these findings with the ones obtained in the null model. These last
ones are reported in Figure 25. Looking at these results and the ones represented in Figure 24, we can
conclude that, again, the behavior observed in these last figures is not random but it is a property of
Reddit.
Finally, Figure 26(a) reports the number of authors of I39 connected to at least one author of Ik,
whereas Figure 26(b) shows the percentage of authors of Ikconnected with at least one author of I39.
6We did not choose I40 because the number of its authors is less than the ones of the other intervals.
25
(a) (b)
Figure 24: (a) Number of authors of I20 connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I20
(a) (b)
Figure 25: (a) Number of authors of I20 connected to at least one author of Ikin the null model - (b)
Percentage of authors of Ikconnected to at least one author of I20 in the null model
Again, there is a strict correlation between authors with a low degree centrality. Also for this last
case, we compared the results obtained with the ones returned using the null model. We report these
last ones in Figure 27. The comparison of these figures confirms that the behavior observed in them
is a property intrinsic to Reddit.
(a) (b)
Figure 26: (a) Number of authors of I39 connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I39
26
(a) (b)
Figure 27: (a) Number of authors of I39 connected to at least one author of Ikin the null model - (b)
Percentage of authors of Ikconnected to at least one author of I39 in the null model
Having verified that there exists a sort of backbone among the authors with a high (resp., inter-
mediate, low) degree centrality, we can conclude that actually Reddit is assortative with respect to
degree centrality, as far as the co-posting relationship is concerned.
This important result can be explained considering the concept of karma and the posting rules in
Reddit. Indeed, in this platform, each user has associated a karma, which is a score taking her past
“reputation” into account. Generally, users with high karma are very active and, often, submit a lot of
appreciated posts. As a consequence, it is presumable that they have a high degree centrality. In other
words, a direct correlation between karma and degree centrality can be recognized for authors. Now,
the posting rules of Reddit state that each subreddit has associated a minimum threshold of karma
[36, 37, 6] so that only the authors with a karma higher than this threshold can submit a post on it.
This threshold is dynamic and changes over time. Clearly, when it is low, all the authors can submit
their posts on the subreddit. When it grows, the authors with a low karma (and, presumably, with
a low degree centrality) cannot submit posts on it. Finally, when it becomes high, only the authors
with a high karma (and, presumably, a high degree centrality) can submit posts on it. This way of
proceeding tends to segment users into groups having homogeneous degree centralities.
Having verified the assortativity of Reddit with respect to degree centrality, it is natural to wonder
whether this property depends on the type of centrality or is intrinsic in this social platform. As a
premise to this investigation, it is worth underlying that each form of assortativity is a unique history
per se. Therefore, it is impossible to define a general rule. Nevertheless, it is possible to verify if a
trend exists, and we have operated in this direction.
To this end, we have chosen a second form of centrality (i.e., the eigenvector centrality) and we
have repeated for it all the steps previously seen for degree centrality. The results obtained are shown
in Figures 28 - 30
They confirm that there is an assortativity among the authors of Reddit also with respect to the
eigenvector centrality. As a consequence, we can conclude that the assortativity of Reddit authors is
not limited to degree centrality but represents a trend characterizing this social platform beyond the
form of centrality taken into consideration.
27
(a) (b)
(c) (d)
Figure 28: (a) Number of authors of I1connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I1- (c) Number of authors of I1connected to at
least one author of Ikin the null model - (d) Percentage of authors of Ikconnected to at least one
author of I1in the null model
8 Discussion
In this section, we examine the results on subreddit stereotypes in order to identify their correlations
and build an overview of the knowledge on Reddit extracted in this paper.
First of all, we observe that, although in principle subreddit stereotypes and author stereotypes
are two orthogonal concepts, in practice there are strong correlations between them. In fact, certain
subreddit stereotypes are the ideal and perfectly tailored places for certain user stereotypes, and vice
versa.
Let us now examine these correlations more closely. In the following of this section, for more
clarity and to avoid heavy speech, we use the Successful Subreddit notation to indicate the name of a
subreddit stereotype, whereas we adopt the Successful Author notation to denote an author stereotype.
User Profile is a fairly generic subreddit stereotype and can be related, at least partially, to various
author stereotypes. Surely, a Fame Seeker can create a User Profile subreddit to advertise her profile.
A similar argument probably applies to a Content Creator and a Successful Author.
Unsuccessful Subreddit could be at least partially related to Unsuccessful Author because if a
subreddit was not successful then its posts did not attract Reddit users. Clearly, the authors of those
posts, if this fact happens several times, would tend to become unsuccessful authors.
Clearly, there are very strong and direct correlations between Comment Grabber and the homony-
28
(a) (b)
(c) (d)
Figure 29: (a) Number of authors of I20 connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I20 - (c) Number of authors of I20 connected to at
least one author of Ikin the null model - (d) Percentage of authors of Ikconnected to at least one
author of I20 in the null model
mous author stereotype, between Big Comment Grabber and Big Comment Grabber, between Private
Community and PP Author, between Bot and the homonymous author stereotype, and between
Cringe / NSFW Subreddit and Cringe / NSFW Author.
There is at least a partial relationship between Banned Subreddit and Spammer and Hatred Sower,
because it is very likely that subreddits with many authors of those two categories are banned. Sim-
ilarly, there is a correlation between Successful Subreddit and Successful Author; in fact, it is likely
that if many successful authors write in a subreddit, then that subreddit will be successful.
A less obvious, but extremely interesting correlation exists between Niche Subreddit and FBG
Publisher.
Again, Unsuccessful Boomer may be related to Fame Seeker,Cringe / NSFW Author,Hatred
Sower or Investigator. In all these cases, the authors of these subreddits may have initially succeeded
in stimulating the attention of other Reddit users but, after a while, this attention was lost.
Finally, there is a quite evident correlation between Unsuccessful Zombie and Unsuccessful Author,
in the sense that if an author activates subreddits that become Unsuccessful Zombie, in the long run
she risks to become an Unsuccessful Author. Finally, Unsuccessful Zombie could have a slightly subtler
and hidden correlation with Moody Author because, if in a subreddit many posts of moody authors
are published, it is likely that this subreddit will not attract people and eventually will become an
Unsuccessful Zombie.
29
(a) (b)
(c) (d)
Figure 30: (a) Number of authors of I39 connected to at least one author of Ik- (b) Percentage of
authors of Ikconnected to at least one author of I39 - (c) Number of authors of I39 connected to at
least one author of Ikin the null model - (d) Percentage of authors of Ikconnected to at least one
author of I39 in the null model
After having examined the correlation between subreddit stereotypes and author stereotypes, we
continue our discussion by examining the correlations between the results obtained for author stereo-
types and those concerning assortativity. In Section 7, we found that there is a degree (resp., eigen-
vector) assortativity between Reddit authors. This implies that authors with similar degree (resp.,
eigenvector) centrality tend to form a backbone. Keeping in mind the definition and properties of
these two forms of centrality, it is possible to make some interesting deductions.
The first one is that Fame Seekers, who generally have a high degree centrality, tend to form a
backbone and, therefore, to support each other. An analogous reasoning can be imagined for Successful
Authors and Reporters, who are also characterized by a very high degree centrality. Continuing in
this direction, even many authors characterized by negative stereotypes tend to support each other;
in particular, this happens for Spammers,Hatred Sowers and Investigators. In these cases, a post
published by one of them tends to provoke the reaction of the others, giving rise to very long discussions
that often involve a huge number of people. A similar situation, even if with a neutral and not negative
connotation, can concern the Big Comment Grabbers. Even these authors tend to form communities
in which large discussions take place; however, unlike the previous cases, these discussions are not
necessarily harmful.
As far as eigenvector centrality is concerned, in addition to all the communities mentioned above,
the presence of backbones between FBG Publishers or Content Creators appears possible. In fact,
30
these authors, who tend to use Reddit as a utility tool, may be strongly attracted by subreddits
created by authors with the same intentions and, therefore, may tend to form communities. It is
interesting to highlight that these types of figures (a sort of “grey cardinals”) are the classical ones
having a high eigenvector centrality and, as far as we are concerned, a high eigenvector assortativity.
A final discussion concerns the results on assortativity described in this paper and the ones on
assortativity in social networks described in the past literature. As previously pointed out, Newman’s
seminal work showed that social networks are generally assortative, unlike other types of networks,
such as technological and biological ones, which are disassortative [39].
Next, the authors of [3] demonstrated that: (i) Cyworld is sligthly disassortative with respect
to degree centrality on a network built taking users and their friendships into account, while it is
strongly assortative with respect to degree centrality on a network built considering users and the
“testimonial” relationships (a kind of relationship specific of this social network) existing between
them; (ii) Orkut is assortative with respect to degree centrality on a network built starting from users
and their friendships; (iii) MySpace is neutral (that is neither assortative nor disassortative) with
respect to degree centrality on a network that takes users and their friendships into account.
The authors of [8] showed that Twitter is strongly assortative with respect to degree centrality
on a network that takes the sharing of interest among users into account. Furthermore, the authors
of [10] studied assortativity in Facebook and showed that such a social network is assortative with
respect to the tendency of a bridge (i.e., a user joining more social networks) to communicate with
other bridges.
Finally, in [25], the authors considered Reddit and investigated the concept of assortativity but for
a very particular aspect, i.e., loyal communities. In particular, they showed that loyal communities
are not assortative with respect to the activity level of the users belonging to them, while assortativity
exists in the case of unloyal communities. The lack of assortativity in loyal communities implies that
users belonging to them are willing to communicate with all the other users of the same community,
regardless the corresponding activity level. By contrast, the presence of assortativity in unloyal com-
munities implies that the corresponding users tend to partition themselves into subgroups based on
their activity level. Indeed, a user with a certain activity level tend to communicate only with users
having similar activity levels.
As said before, our paper wants to provide a contribution in the study of assortativity in social
networks. First, besides degree centrality, it also considers eigenvector centrality. Furthermore, it
focuses on the study of assortativity in Reddit, a social platform that was not analyzed in the past as
far as this feature is concerned, except for the investigations described in [25]. However, in this last
paper, the main topic of the author investigation was not assortativity but loyalty, while assortativity
simply served as a feature to assess whether loyal and unloyal communities could be partitioned into
smaller groups. Therefore, compared to the general studies on assortativity presented in [3, 8, 10],
the analysis of [25] can be considered of niche. As a proof of this, we can observe that, contrary to all
studies on assortativity proposed in the past, in [25] the presence of assortativity among the nodes of
a network is seen as a negative factor (leading highly active users to disregard little active and new
ones), rather than a positive feature.
Compared to [25], our paper aims at bringing the study of assortativity into Reddit in the general
mainstream of the study of assortativity in social networks, analyzing this feature by itself, indepen-
31
dently from other features, such as loyalty. As a matter of fact, the results we found are in line, and
even strengthen, the trends on assortativity in social networks hypothesized by Newman and next
found by most of the other authors.
9 Possible applications of stereotypes
This section presents some possible applications of the stereotypes previously investigated. It consists
of two subsections. The first explains how subreddit stereotypes could be employed to make a subreddit
successful. The second highlights how particular types of author stereotypes could be used to improve
the content quality of subreddits.
9.1 Subreddit stereotypes
In Section 5, we defined several subreddit stereotypes belonging to three macro-categories, namely
“dead in crib”, “survivors” and “undelivered promises”. A first application of this research can be
the definition of some guidelines to follow in order to make a subreddit successful. Indeed, knowing
how a subreddit became successful (resp., unsuccessful) can lead to the characterization of “positive”
(resp., “negative”) actions that can influence the “lifespan” of a new subreddit. For instance, consider
the subreddit /r/meme. It started during 2008 and, at the time of writing, has about 806,000 users.
Certainly, it represents an example of a successful subreddit. Here, the authors post high quality and
engaging contents. This kind of behavior could be registered as a “best practice” in the guidelines.
On the other hand, a subreddit containing only few contents from few authors is an example of an
unsuccessful subreddit. This failure could be caused by a lack of engaging contents posted in it.
Clearly, what said above provides just an idea of what these guidelines could contain.
Another possible application of subreddit stereotypes could regard the definition and realization
of recommender systems for Reddit. These systems would aim at recommending to a user subreddits
with the same stereotype (or the same content) as the ones characterizing the subreddits accessed
by her in the past. In any case, the recommender system should avoid “dead in crib” subreddits
or, more generally, unsuccessful ones. On the other hand, the same system should suggest to a user
successful subreddits, subreddits currently expanding their community and/or subreddits characterized
by contents in line with her profile.
A further example of possible usage of subreddit stereotypes could be the definition of an algorithm
that finds subreddits to merge or, at least, to integrate. For instance, consider two zombie subreddits
with related topics, where authors are posting contents that were not able to attract other users. These
two subreddits are surviving, but their interactions with users are so low that they can actually be
considered dead. If they would be merged or integrated into a unique subreddit, they could have more
chances of becoming successful. Joining together two, or even more, subreddits having the same (or
related) topics/characteristics brings more visibility and more contents to them. These contents would
be, otherwise, dispersed in different unsuccessful subreddits. Even if the new integrated subreddit is
made up of past zombies, it could become so successful to attract authors and co-posters from other
communities.
32
9.2 Author stereotypes
In Section 6, we defined some possible author stereotypes. Some of them are strictly related to the
homonymous or corresponding subreddit stereotypes. Other ones, instead, are intrinsic to human
behavior and, in particular, to the concept of author. For example, consider “Fame Seekers” and
“Content Creators”. These users could represent the target of a proposal of an advertising campaign
aiming at promoting them. Take, for instance, a painter or a digital artist, who has been classified as
“Fame Seeker”. An advertising company can easily persuade her to give it an engagement to promote
her image.
Another possible usage of author stereotypes is the definition and implementation of different cat-
egories of recommender systems. A first category could help bootstrapping a subreddit. Consider, for
instance, a newborn subreddit where authors post comics strips created by them. Knowing successful
authors of comics strips and being able to convince them to become “Content Creators” in the new
subreddit could help this last one to get visibility. Complementary to this case, a second category of
recommender systems could be used for talent scouting. In this case, a “Fame Seeker”, who is also a
creator of comics strips, could be recommended to successful subreddits if her contents are high-quality
ones.
The last application we present in this overview is the definition of an algorithm that builds
blacklists of users based on author stereotypes. As an example, we can define a “dangerousness level”
of an author for one subreddit, a set of subreddits or all subreddits. For instance, in such a scenario,
“Hatred Sowers” can be automatically banned from subreddits attended by sensitive people. This way
of proceeding could certainly maintain the discussion in these subreddits clean, thus avoiding their
visitors being harassed by fake news and cyberbullying.
10 Conclusion
In this paper, we have presented an investigation on Reddit, whose aim was analyzing three aspects
of this social platform that are interesting for both the theory and the practice. First, we have
examined related literature and we have described the dataset used for our investigation. Then,
we have illustrated some preliminary analyses that allowed us to gather some (partially expected)
information, useful to correctly carry out the following activities and interpret the corresponding
results.
The first knowledge detected in our investigation is subreddit stereotypes. We have explained the
way of proceeding that we followed to determine them, we have defined three macro-categories and, for
each of them, a certain number of stereotypes. Finally, we have proposed three orthogonal taxonomies
and we have classified the detected stereotypes according to them. We have proceeded in the same
way performing the second main task of our investigation, namely the definition and classification of
author stereotypes. Afterwards, we have focused on a more theoretical issue. In fact, analogously to
what has been carried out for other social platforms, we have verified if Reddit is assortative, and in
which way. We have found that a degree and an eigenvector assortativity exist in Reddit and that they
involve co-posters. Finally, we have presented several applications that could benefit from subreddit
and author stereotypes.
33
In the future, we plan to develop our research on Reddit along several directions. First of all,
we would like to carry out a deep investigation on NSFW subreddits. In fact, in spite they are very
numerous, few analyses on them have been performed in the past literature. Furthermore, in Section
9.1, we have seen that the merge, or at least the integration, of related subreddits could be extremely
beneficial. Therefore, we plan to define an approach that finds possible subreddits to merge or to
integrate and, then, suggests the tasks necessary to carry out this activity. Last, but not the least,
we would like to define an approach to find duplicate accounts, i.e., two or more Reddit accounts
belonging to the same person. We would like to understand the main motivations leading a user to
adopt multiple accounts and verify if she has different behaviors in different accounts.
Acknowledgments
This work was partially supported by: (i) the Italian Ministry for Economic Development (MISE)
under the project “Smarter Solutions in the Big Data World”, funded within the call “HORIZON2020”
PON I&C 2014-2020 (CUP B28I17000250008), and (ii) the Department of Information Engineering
at the Polytechnic University of Marche under the project “A network-based approach to uniformly
extract knowledge and support decision making in heterogeneous application contexts” (RSAB 2018).
References
[1] Six stereotypes you follow on Instagram. https://www.kaindefoecommunications.com/
new-england-social-media- marketing/6-stereotypes-you- follow-on- instagram/, 2020.
[2] The Stereotypes of Facebook. https://www.ericsson.com/en/blog/2011/9/
facebook-stereotypes-which-type- are-you, 2020.
[3] Y.Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social
networking services. In Proc. of the International Conference on World Wide Web (WWW’07), pages 835–844,
Banff, Alberta, Canada, 2007. ACM.
[4] A. Al-Zoubi, J. Alqatawna, H. Faris, and M. A Hassonah. Spam profiles detection on social networks using
computational intelligence methods: The effect of the lingual context. Journal of Information Science, page
0165551519861599, 2019. SAGE.
[5] J. An, H. Kwak, O. Possega, and A. Jungherr. Political Discussions in Homogeneous and Cross-Cutting Commu-
nication Spaces. In Proc. of the International Conference on Web and Social Media (ICWSM 2019), pages 68–79,
Munich, Germany, 2019. AAAI.
[6] K.E. Anderson. Ask me anything: what is Reddit? 2015. Emerald.
[7] F. Buccafurri, G. Lax, A.Nocera, and D. Ursino. A system for extracting structural information from Social Network
accounts. Software Practice & Experience, 45(9):1251–1275, 2015. John Wiley & Sons.
[8] F. Buccafurri, G. Lax, S. Nicolazzo, and A. Nocera. Interest Assortativity in Twitter. In Proc. of the 12th
International Conference on Web Information Systems and Technologies (WEBIST 2016), pages 239–246, Rome,
Italy, 2016. ”SCITEPRESS – Science and Technology Publications, Lda”.
[9] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino. Supporting Information Spread in a Social Internetworking
Scenario. Post-Proceedings of the International Workshop on New Frontiers in Mining Complex Knowledge Patterns
at ECML/PKDD 2012 (NFMCP 2012), 200–214. Lecture Notes in Artificial Intelligence, Springer.
[10] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino. Internetworking assortativity in Facebook. In Proc. of the Inter-
national Conference on Social Computing and its Applications (SCA 2013), pages 335–341, Karlsruhe, Germany,
2013. IEEE Computer Society.
34
[11] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino. Discovering Missing Me Edges across Social Networks. Information
Sciences, 319:18–37, 2015. Elsevier.
[12] C. Buntain and J. Golbeck. Identifying Social Roles in Reddit Using Network Structure. In Proc. of the International
Conference on World Wide Web (WWW 2014), page 615–620, Seoul, Korea, 2014. ACM.
[13] N. Cassavia, E. Masciari, C. Pulice, and D. Sacc`a. Discovering User Behavioral Features to Enhance Information
Search on Big Data. ACM Transactions on Interactive Intelligent Systems, 7(2), 2017. ACM.
[14] X. Chen, Y. Yuan, and M. Ali Orgun. Using bayesian networks with hidden variables for identifying trustworthy
users in social networks. Journal of Information Science, page 0165551519857590, 2019. SAGE.
[15] E. Corradini, A. Nocera, D. Ursino, and L. Virgili. Defining and detecting k-bridges in a social network: the Yelp
case, and more. Knowledge-Based Systems, 187:104820, 2020. Elsevier.
[16] T. Cunha, D. Jurgens, C. Tan, and D. Romero. Are All Successful Communities Alike? Characterizing and
Predicting the Success of Online Communities. In Proc. of the World Wide Web Conference (WWW 2019), pages
318–328, San Francisco, CA, USA, 2019. ACM.
[17] K. Darwish, P. Stefanov, M.J. Aupetit, and P. Nakov. Unsupervised User Stance Detection on Twitter. In Proc. of
the International Conference on Web and Social Media (ICWSM 2020), pages 141–152, Atlanta, GA, USA, 2020.
AAAI Press.
[18] S. Datta and E. Adar. Extracting Inter-Community Conflicts in Reddit. In Proc. of the International Conference
on Web and Social Media (ICWSM 2019), pages 146–157, Munich, Germany, 2019. AAAI.
[19] C. Donato, P. Lo Giudice, R. Marretta, D. Ursino, and L. Virgili. A well-tailored centrality measure for evaluating
patents and their citations. Journal of Documentation, 75(4):750–772, 2019. Emerald.
[20] B. Ferwerda and M. Schedl. Personality-Based User Modeling for Music Recommender Systems. In Proc. of the
Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2016),
pages 254–257, Riva del Garda, Italy, 2016. Springer International Publishing.
[21] C. Fiesler, J. Jiang, J. McCann, K. Frye, and J. Brubaker. Reddit Rules! Characterizing an Ecosystem of Gover-
nance. In Proc. of International Conference on Web and Social Media (ICWSM 2018), pages 72–81, Stanford, CA,
USA, 2018. AAAI.
[22] M. Fire and C. Guestrin. The rise and fall of network stars: Analyzing 2.5 million graphs to reveal how high-degree
vertices emerge over time. Information Processing & Management, 57(2):102041, 2020. Elsevier.
[23] T. Grover and G. Mark. Detecting Potential Warning Behaviors of Ideological Radicalization in an Alt-Right
Subreddit. In Proc. of the International Conference on Web and Social Media (ICWSM 2019), pages 193–204,
Munich, Germany, 2019. AAAI.
[24] A. Guimaraes, O. Balalau, E. Terolli, and G. Weikum. Analyzing the Traits and Anomalies of Political Discussions
on Reddit. In Proc. of the International Conference on Web and Social Media (ICWSM 2019), pages 205–213,
Munich, Germany, 2019. AAAI.
[25] W. Hamilton, J. Zhang, C. Danescu-Niculescu-Mizil, D. Jurafsky, and J. Leskovec. Loyalty in Online Communities.
In Proc. of the International Conference on Web and Social Media (ICWSM 2017), pages 540–543, Montreal,
Canada, 2017. AAAI.
[26] J. Hessel, C. Tan, and L. Lee. Science, AskScience, and BadScience: On the Coexistence of Highly Related
Communities. In Proc. of the International Conference on Web and Social Media (ICWSM 2016), pages 171–180,
Cologne, Germany, 2016. AAAI.
[27] Y. Kou, C.M. Gray, A.L. Toombs, and R.S. Adams. Understanding Social Roles in an Online Community of
Volatile Practice: A Study of User Experience Practitioners on Reddit. ACM Transactions on Social Computing,
1(4):17:1–17:22, 2018. ACM.
[28] S. Kumar, J. Cheng, and J. Leskovec. Antisocial Behavior on the Web: Characterization and Detection. In
Proc. of the International Conference on World Wide Web Companion, page 947–950, Geneva, Switzerland, 2017.
International World Wide Web Conferences Steering Committee.
35
[29] S. Kumar, W.L. Hamilton, J. Leskovec, and D. Jurafsky. Community Interaction and Conflict on the Web. In Proc.
of the World Wide Web Conference (WWW 2018), pages 933–943, Lyon, France, 2018. ACM.
[30] J. LaViolette and B. Hogan. Using Platform Signals for Distinguishing Discourses: The Case of Men’s Rights and
Men’s Liberation on Reddit. In Proc. of the International Conference on Web and Social Media (ICWSM 2019),
pages 323–334, Munich, Germany, 2019. AAAI.
[31] W. Lippmann. Public Opinion. 1922. Macmillan.
[32] J. Ma and Y. Luo. The classification of rumour standpoints in online social network based on combinatorial
classifiers. Journal of Information Science, 46(2):191–204, 2020. SAGE.
[33] S. Mazhari, S.M. Fakhrahmad, and H. Sadeghbeygi. A user-profile-based friendship recommendation solution in
social networks. Journal of Information Science, 41(3):284–295, 2015. SAGE.
[34] M. McPherson, L. Smith-Lovin, and J.M. Cook. Birds of a feather: Homophily in social networks. Annual Review
of Sociology, 27:415–444, 2001. JSTOR.
[35] A.N. Medvedev, R. Lambiotte, and J.C. Delvenne. The Anatomy of Reddit: An Overview of Academic Research.
In Dynamics On and Of Complex Networks III, pages 183–204, Cham, 2019. Springer International Publishing.
[36] J. Meese. “It belongs to the Internet”: Animal images, attribution norms and the politics of amateur media
production. M/C Journal, 17(2):1–3, 2014. M/C.
[37] D. Morrison and C. Hayes. Here, have an upvote: Communication behaviour and karma on Reddit. Informatik,
pages 2258–2268, 2013. Gesellschaft f¨ur Informatik eV.
[38] E. Newell, D. Jurgens, H.M. Saleem, H. Vala, J. Sassine, C. Armstrong, and D. Ruths. User Migration in Online
Social Networks: A Case Study on Reddit During a Period of Community Unrest. In Proc. of the International
Conference on Web and Social Media (ICWSM 2016), pages 279–288, Cologne, Germany, 2016. AAAI.
[39] M.E.J. Newman. Clustering and preferential attachment in growing networks. Physical Review E, 64(2):025102,
2001. APS.
[40] M. Pennacchiotti and A. Popescu. Democrats, republicans and starbucks afficionados: user classification in Twitter.
In Proc. of the International Conference on Knowledge Discovery and Data Mining (KDD), pages 430–438, San
Diego, CA, USA, 2011. ACM.
[41] G. Ramponi, M. Brambilla, S. Ceri, F. Daniel, and M. Di Giovanni. Content-based characterization of online social
communities. Information Processing & Management, page 102133, 2019. Elsevier.
[42] Q. Shen and R. Carolyn. The Discourse of Online Content Moderation: Investigating Polarized User Responses to
Changes in Reddit’s Quarantine Policy. In Proc. of the International Workshop on Abusive Language Online (ALW
2019), pages 58–69, Florence, Italy, 2019. Association for Computational Linguistics.
[43] P. Singer, F. Fl¨ock, C. Meinhart, E. Zeitfogel, and M. Strohmaier. Evolution of Reddit: From the Front Page of the
Internet to a Self-Referential Community? In Proc. of the International Conference on World Wide Web (WWW
2014), page 517–522, Seoul, Korea, 2014. ACM.
[44] A. Soliman, J. Hafer, and F. Lemmerich. A Characterization of Political Communities on Reddit. In Proc. of the
ACM Conference on Hypertext and Social Media (HT’19), page 259–263, Hof, Germany, 2019. ACM.
[45] R.P. Subbanarasimha, S. Srinivasa, and S. Mandyam. Invisible Stories That Drive Online Social Cognition. IEEE
Transactions on Computational Social Systems, pages 1–14, 2020. IEEE.
[46] C. Tan and L. Lee. All Who Wander: On the Prevalence and Characteristics of Multi-Community Engagement. In
Proc. of the International Conference on World Wide Web (WWW 2015), page 1056–1066, Florence, Italy, 2015.
ACM.
[47] M. Tsvetovat and A. Kouznetsov. Social Network Analysis for Startups: Finding connections on the social web.
Sebastopol, CA, USA, 2011. O’Reilly Media, Inc.
[48] T. Weninger. An exploration of submissions and discussions in social news: mining collective intelligence of Reddit.
Social Network Analysis and Mining, 4:173–192, 2014. Springer.
36
[49] D. Zhang, J. Yin, X. Zhu, and C. Zhang. User Profile Preserving Social Network Embedding. In Proc. of the
International Joint Conference on Artificial Intelligence (IJCAI’17), pages 3378–3384, Melbourne, Australia, 2017.
ijcai.org.
[50] J. Zhang, W. Hamilton, C. Danescu-Niculescu-Mizil, D. Jurafsky, and J. Leskovec. Community Identity and User
Engagement in a Multi-Community Landscape. In Proc. of the International Conference on Web and Social Media
(ICWSM 2017), pages 377–386, Montreal, Canada, 2017. AAAI.
37