PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Gang violence is a severe issue in major cities across the U.S. and recent studies [Patton et al. 2017] have found evidence of social media communications that can be linked to such violence in communities with high rates of exposure to gang activity. In this paper we partnered computer scientists with social work researchers, who have domain expertise in gang violence, to analyze how public tweets with images posted by youth who mention gang associations on Twitter can be leveraged to automatically detect psychosocial factors and conditions that could potentially assist social workers and violence outreach workers in prevention and early intervention programs. To this end, we developed a rigorous methodology for collecting and annotating tweets. We gathered 1,851 tweets and accompanying annotations related to visual concepts and the psychosocial codes: aggression, loss, and substance use. These codes are relevant to social work interventions, as they represent possible pathways to violence on social media. We compare various methods for classifying tweets into these three classes, using only the text of the tweet, only the image of the tweet, or both modalities as input to the classifier. In particular, we analyze the usefulness of mid-level visual concepts and the role of different modalities for this tweet classification task. Our experiments show that individually, text information dominates classification performance of the loss class, while image information dominates the aggression and substance use classes. Our multimodal approach provides a very promising improvement (18% relative in mean average precision) over the best single modality approach. Finally, we also illustrate the complexity of understanding social media data and elaborate on open challenges.
Content may be subject to copyright.
Multimodal Social Media Analysis for Gang Violence Prevention
Philipp Blandfort,1,2,Desmond Patton,3William R. Frey,3Svebor Karaman,3Surabhi Bhargava,3
Fei-Tzin Lee,3Siddharth Varia,3Chris Kedzie,3Michael B. Gaskell,3Rossano Schifanella,4Kathleen
McKeown,3Shih-Fu Chang3
1DFKI, Kaiserslautern, Germany
2TU Kaiserslautern, Kaiserslautern, Germany
3Columbia University, New York City, USA
4University of Turin, Turin, Italy
philipp.blandfort@dfki.de,{dp2787,w.frey,svebor.karaman,sb4019,2301,sv2504}@columbia.edu,kedzie@cs.columbia.
edu,mbg2174@columbia.edu,schifane@di.unito.it,kathy@cs.columbia.edu,sc250@columbia.edu
ABSTRACT
Gang violence is a severe issue in major cities across the U.S. and
recent studies [
23
] have found evidence of social media communi-
cations that can be linked to such violence in communities with
high rates of exposure to gang activity. In this paper we partnered
computer scientists with social work researchers, who have domain
expertise in gang violence, to analyze how public tweets with im-
ages posted by youth who mention gang associations on Twitter
can be leveraged to automatically detect psychosocial factors and
conditions that could potentially assist social workers and violence
outreach workers in prevention and early intervention programs.
To this end, we developed a rigorous methodology for collecting
and annotating tweets. We gathered 1,851 tweets and accompa-
nying annotations related to visual concepts and the psychosocial
codes:aggression,loss, and substance use. These codes are relevant
to social work interventions, as they represent possible pathways
to violence on social media. We compare various methods for clas-
sifying tweets into these three classes, using only the text of the
tweet, only the image of the tweet, or both modalities as input to
the classier. In particular, we analyze the usefulness of mid-level
visual concepts and the role of dierent modalities for this tweet
classication task. Our experiments show that individually, text
information dominates classication performance of the loss class,
while image information dominates the aggression and substance
use classes. Our multimodal approach provides a very promising
improvement (18% relative in mean average precision) over the
best single modality approach. Finally, we also illustrate the com-
plexity of understanding social media data and elaborate on open
challenges.
1 INTRODUCTION
Gun violence is a critical issue for many major cities. In 2016,
Chicago saw a 58% surge in gun homicides and over 4,000 shooting
victims, more than any other city comparable in size [
13
]. Recent
data suggest that gun violence victims and perpetrators tend to
have gang associations [
13
]. Notably, there were fewer homicides
originating from physical altercations in 2016 than in the previ-
ous year, but we have little empirical evidence explaining why.
Burgeoning social science research indicates that gang violence
may be exacerbated by escalation on social media and the “digi-
tal street" [
16
] where exposure to aggressive and threatening text
* During some of this work Blandfort was staying at Columbia University.
Multimodal
Analysis
text and image features
+
local visual concepts
Psychosocial
Code
Aggression
Loss
Substance Use
Name11 @user11
High af
Name123 @user123
Comin for ya
Figure 1: We propose a multimodal system for detecting psy-
chosocial codes of social media tweets1related to gang vio-
lence.
and images can lead to physical retaliation, a behavior known as
“Internet banging" or “cyberbanging" [22].
Violence outreach workers present in these communities are
thus attempting [
21
] to prioritize their outreach around contextual
features in social media posts indicative of oine violence, and to
try to intervene and de-escalate the situation when such features
are observed. However, as most tweets do not explicitly contain
features correlated with pathways of violence, an automatic or semi-
automatic method that could ag a tweet as potentially relevant
would lower the burden of this task. The automatic interpretation
of tweets or other social media posts could therefore be very helpful
in intervention, but quite challenging to implement for a number
of reasons, e.g. the informal language, the African American Ver-
nacular English, and the potential importance of context to the
meaning of the post. In specic communities (e.g. communities
with high rates of violence) it can be hard even for human outsiders
to understand what is actually going on.
To address this challenge, we have undertaken a rst multimodal
step towards developing such a system that we illustrate in Figure 1.
Our major contributions lie in innovative application of multimedia
analysis of social media in practical social work study, specically
covering the following components:
We have developed a rigorous framework to collect context-
correlated tweets of gang-associated youth from Chicago
containing images, and high-quality annotations for these
tweets.
We have teamed up computer scientists and social work
researchers to dene a set of visual concepts of interest.
We have analyzed how the psychosocial codes loss,aggres-
sion, and substance use are expressed in tweets with images
1
Note that the “tweets” in Figure 1 were created for illustrative purpose using Creative
Commons images from Flickr and are NOT actual tweets from our corpus. Attributions
of images in Figure 1, from left to right: “IMG_0032.JPG" by sashimikid, used under
CC BY-NC-ND 2.0, “gun" by andrew_xjy, used under CC BY-NC-ND 2.0.
arXiv:1807.08465v1 [cs.LG] 23 Jul 2018
arXiv version, 2018, July P. Blandfort et al.
and developed methods to automatically detect these codes,
demonstrating a signicant performance gain of 18% by mul-
timodal fusion.
We have trained and evaluated detectors for the concepts
and psychosocial codes, and analyzed the usefulness of the
local visual concepts, as well as the relevance of image vs.
text for the prediction of each code.
2 RELATED WORK
The City of Chicago is presently engaged in an attempt to use an
algorithm to predict who is most likely to be involved in a shooting
as either a victim or perpetrator [
2
]; however, this strategy has
been widely criticized due to lack of transparency regarding the
algorithm [
30
,
31
] and the potential inclusion of variables that may
be inuenced by racial biases present in the criminal justice system
(e.g. prior convictions) [1, 20].
In [
9
], Gerber uses statistical topic modeling on tweets that have
geolocation to predict how likely 20 dierent types of crimes are to
happen in individual cells of a grid that covers the city of Chicago.
This work is a large scale approach for predicting future crime
locations, while we detect codes in individual tweets related to
future violence. Another important dierence is that [
9
] is meant
to assist criminal justice decision makers, whereas our eorts are
community based and have solid grounding in social work research.
Within text classication, researchers have attempted to extract
social events from web data including detecting police killings
[
14
], incidents of gun violence [
25
], and protests [
11
]. However,
these works primarily focus on extracting events from news articles
and not on social media and have focused exclusively on the text,
ignoring associated images.
The detection of local concepts in images has made tremendous
progress in recent years, with recent detection methods [
5
,
10
,
18
,
28
,
29
] leveraging deep learning and ecient architecture enabling
high quality and fast detections. These detection models are usually
trained and evaluated on datasets such as the PascalVOC [
8
] dataset
and more recently the MSCOCO [
17
] dataset. However, the classes
dened in these datasets are for generic consumer applications
and do not include the visual concepts specically related to gang
violence, dened in section 3.2. We therefore need to dene a lexicon
of gang-violence related concepts and train own detectors for our
local concepts.
The most relevant prior work is that of [
4
]. They predict ag-
gression and loss in the tweets of Gakirah Barnes and her top
communicators using an extensive set of linguistic features, in-
cluding mappings of African American vernacular English and
emojis to entries in the Dictionary of Aective Language (DAL).
The linguistic features are used in a linear SVM to make a 3-way
classication between loss, aggression, and other. In this paper
we additionally predict the presence of substance use, and model
this problem as three binary classication problems since multiple
codes may simultaneously apply. We also explore character and
word level CNN classiers, in addition to exploiting image features
and their multimodal combinations.
3 DATASET
In this section we detail how we have gathered and annotated the
data used in this work.
3.1 Obtaining Tweets
Working with community social workers, we identied a list of 200
unique users residing in Chicago neighborhoods with high rates
of violence. These users all suggest on Twitter that they have a
connection, aliation, or engagement with a local Chicago gang
or crew. All of our users were chosen based on their connections
to a seed user, Gakirah Barnes, and her top 14 communicators in
her Twitter network
2
. Gakirah was a self-identied gang member
in Chicago, before her death in April, 2014. Additional users were
collected using snowball sampling techniques [
3
]. Using the public
Twitter API, in February 2017 we scraped all obtainable tweets from
this list of 200 users. For each user we then removed all retweets,
quote tweets and tweets without any image, limiting the number
of remaining tweets per user to 20 to avoid most active users being
overrepresented. In total the resulting dataset consists of 1,851
tweets from 173 users.
3.2 Local Visual Concepts
To extract relevant information in tweet images related to gang
violence, we develop a specic lexicon consisting of important and
unique visual concepts often present in tweet images in this do-
main. This concept list was dened through an iterative process
involving discussions between computer scientists and social work
researchers. We rst manually went through numerous tweets with
images and discussed our observations to nd which kind of infor-
mation could be valuable to detect, either for direct detection of
“interesting" situations but also for extracting background informa-
tion such as aliation to a specic gang that can be visible from a
tattoo. Based on these observations we formulated a preliminary list
of visual concepts. We then collectively estimated utility (how use-
ful is the extraction of the concept for gang violence prevention?),
detectability (is the concept visible and discriminative enough for
automatic detection?), and observability for reliable annotation
(can we expect to obtain a sucient number of annotations for
the concept?), in order to rene this list of potential concepts and
obtain the nal lexicon.
Our interdisciplinary collaboration helped to minimize the risk
of overseeing potentially important information or misinterpreting
behaviors that are specic to this particular community. For ex-
ample, on the images we frequently nd people holding handguns
with an extended clip and in many of these cases the guns are held
at the clip only. The computer scientists of our team did not pay
much attention to the extended clips and were slightly confused
by this way of holding the guns, but then came to learn that in
this community an extended clip counts as a sort of status symbol,
2
Top communicators were statistically calculated by most mentions and replies to
Gakirah Barnes.
3
Attributions of Figure 2, from left to right: “GUNS" by djlindalovely, used under CC
BY-NC-ND
2.0
, “my sistah the art gangstah" by barbietron, used under CC BY-NC
2.0
, “Money" by jollyuk, used under CC BY
2.0
, “IMG
_
0032.JPG" by sashimikid, used
under CC BY-NC-ND
2.0
, “#codeine time" by amayzun, used under CC BY-NC-ND
2.0
, “G Unit neck tattoo, gangs Trinidad" by bbcworldservice, used under CC BY-NC
2.0
. Each image has been modied to show the bounding boxes of the local concepts
of interest present in it.
Multimodal Social Media Analysis for Gang Violence Prevention arXiv version, 2018, July
(a) handgun, long gun (b) person, hand gesture (c) money (d) marijuana, joint (e) lean (f) person, tattoo
Figure 2: Examples of our gang-violence related visual concepts annotated on Creative Commons3images downloaded from
Flickr.
hence this way of holding is meant to showcase a common status
symbol. Such cross-disciplinary discussions lead to inclusion of
concepts such as tattoos and separation of concepts to handgun and
long gun in our concept lexicon.
From these discussions we have derived the following set of local
concepts (in image) of interest:
General: person,money
Firearms: handgun,long gun
Drugs: lean,joint,marijuana
Gang aliation: hand gesture,tattoo
This list was designed in such a way that after the training pro-
cess described above, it could be further expanded (e.g. by specic
hand gestures or actions with guns). We give examples of our local
concepts in Figure 2.
3.3 Psychosocial Codes
Prior studies [
4
,
23
] have identied aggression,loss and substance
use as emergent themes in initial qualitative analysis that were
associated with Internet banging, an emerging phenomenon of
gang aliates using social media to trade insults or make violence
threats. Aggression was dened as posts of communication that
included an insult, threat, mentions of physical violence, or plans
for retaliation. Loss was dened as a response to grief, trauma or
a mention of sadness, death, or incarceration of a friend or loved
one. Substance use consists of mentions, and replies to images that
discuss or show any substance (e.g. marijuana or a liquid substance
colloquially referred to as “lean", see example in Figure 2) with the
exception of cigarettes and alcohol.
The main goal of this work is to automatically detect a tweet that
can be associated with any or multiple of these three psychosocial
codes (aggression,loss and substance use) exploiting both textual
and visual content.
3.4 Annotation
The commonly used annotation process based on crowd sourcing
like Amazon Mechanical Turk is not suitable due to the special
domain-specic context involved and the potentially serious pri-
vacy issues associated with the users and tweets.
Therefore, we adapted and modied the Digital Urban Violence
Analysis Approach (DUVAA) [
4
,
24
] for our project. DUVAA is
a contextually-driven multi-step qualitative analysis and manual
labeling process used for determining meaning in both text and
images by interpreting both on- and oine contextual features. We
adapted this process in two main ways. First, we include a step
to uncover annotator bias through a baseline analysis of annota-
tor perceptions of meaning. Second, the nal labels by annotators
undergo reconciliation and validation by domain experts living in
Chicago neighborhoods with high rates of violence. Annotation is
provided by trained social work student annotators and domain ex-
perts, community members who live in neighborhoods from which
the Twitter data derives. Social work students are rigorously trained
in textual and discourse analysis methods using the adapted and
modied DUVAA method described above. Our domain experts
consist of Black and Latino men and women who aliate with
Chicago-based violence prevention programs. While our domain
experts leverage their community expertise to annotate the Twit-
ter data, our social work annotators undergo a ve stage training
process to prepare them for eliciting context and nuance from the
corpus.
We used the following tasks for annotation:
In the bounding box annotation task, annotators are shown
the text and tweet of the image. Annotators are asked to mark
all local visual concepts of interest by drawing bounding
boxes directly on the image. For each image we collected
two annotations.
To reconcile all conicts between annotations we imple-
mented a bounding box reconciliation task where conicting
annotations are shown side by side and the better annotation
can be chosen by the third annotator.
For code annotation, tweets including the text, image and link
to the original post, are displayed and for each of the three
codes aggression,loss and substance use, there is a checkbox
the annotator is asked to check if the respective code applies
to the tweet. We collected two student annotations and two
domain expert annotations for each tweet. In addition, we
created one extra code annotation to break ties for all tweets
with any disagreement between the student annotations.
Our social work colleagues took several measures to ensure the
quality of the resulting dataset during the annotation process. An-
notators met weekly as a group with an expert annotator to address
any challenges and answer any questions that came up that week.
This process also involved iterative correction of reoccurring anno-
tation mistakes and infusion of new community insights provided
by domain experts. Before the meeting each week, the expert anno-
tator closely reviewed each annotator’s interpretations and labels
to check for inaccuracies.
arXiv version, 2018, July P. Blandfort et al.
Concepts/Codes Twitter Tumblr Total
handgun 164 41 205
long gun 15 105 116
joint 185 113 298
marijuana 56 154 210
person 1368 74 1442
tattoo 227 33 260
hand gesture 572 2 574
lean 43 116 159
money 107 138 245
aggression 457 (185) - 457 (185)
loss 397 (308) - 397 (308)
substance use 365 (268) - 365 (268)
Table 1: Numbers of instances for the dierent visual con-
cepts and psychosocial codes in our dataset. For the dier-
ent codes, the rst number indicates for how many tweets at
least one annotator assigned the corresponding code, num-
bers in parentheses are based on per-tweet majority votes.
During the annotation process, we monitored statistics of the an-
notated concepts. This made us realize that for some visual concepts
of interest, the number of expected instances in the nal dataset was
comparatively small.
4
Specically, this aected the concepts hand-
gun,long gun,money,marijuana,joint, and lean. For all of these
concepts we crawled additional images from Tumblr, using the
public Tumblr API with a keyword-based approach for the initial
crawling. We then manually ltered the images we retrieved to ob-
tain around 100 images for each of these specic concepts. Finally
we put these images into our annotation system and annotated
them w.r.t. all local visual concepts listed in Section 3.2.
3.5 Statistics
The distribution of concepts in our dataset is shown in Table 1. Note
that in order to ensure sucient quality of the annotations, but
also due to the nature of the data, we relied on a special annotation
process and kept the total size of the dataset comparatively small.
Figure 3 displays the distributions of fractions of positive votes
for all 3 psychosocial codes. These statistics indicate that for the
code aggression, disagreement between annotators is substantially
higher than for the codes loss and substance use, which both display
a similar pattern of rather high annotator consensus.
3.6 Ethical considerations
The users in our dataset comprise youth of color from marginalized
communities in Chicago with high rates of gun violence. Releasing
the data has the potential to further marginalize and harm the users
who are already vulnerable to surveillance and criminalization
by law enforcement. Thus, we will not be releasing the dataset
used for this study. However, to support research reproducibility,
we will release only the extracted linguistic and image features
without revealing the raw content; this enables other researchers to
continue research on training psychosocial code detection models
4
We were aiming for at least around 100-200 instances for training plus additional
instances for testing.
Figure 3: Annotator consensus for all psychosocial codes.
For better visibility, we exclude tweets that were unani-
mously annotated as not belonging to the respective codes.
Note that for each tweet there are 4 or 5 code annotations.
without compromising the privacy of our users. Our social work
team members initially attempted to seek informed consent, but
to no avail, as participants did not respond to requests. To protect
users, we altered text during any presentation so that tweets are
not searchable on the Internet, excluded all users that were initially
private or changed their status to private during the analysis, and
consulted Chicago-based domain experts on annotation decisions,
labels and dissemination of research.
4 METHODS FOR MULTIMODAL ANALYSIS
In this section we describe the building blocks for analysis, the text
features and image features used as input for the psychosocial code
classication with an SVM, and the multimodal fusion methods we
explored. Details of implementation and analysis of results will be
presented in Sections 5 and 6.
4.1 Text features
As text features, we exploit both sparse linguistic features as well
as dense vector representations extracted from a CNN classier
operating at either the word or character level.
Linguistic features. To obtain the linguistic features, we used the
feature extraction code of [
4
] from which we obtained the following:
Unigram and bigram features.
Part-of-Speech (POS) tagged unigram and bigram features.
The POS tagger used to extract these features was adapted
to this domain and cohort of users.
The minimum and maximum pleasantness, activation, and
imagery scores of the words in the input text. These scores
are computed by looking up each word’s associated scores
in the Dictionary of Aective Language (DAL). Vernacular
words and emojis were mapped to the Standard American
English of the DAL using a translation phrasebook derived
from this domain and cohort of users.
CNN features. To extract the CNN features we train binary classi-
ers for each code. We use the same architecture for both the word
and character level models and so we describe only the word level
model below. Our CNN architecture is roughly the same as [
15
] but
with an extra fully connected layer before the nal softmax. I.e.,
Multimodal Social Media Analysis for Gang Violence Prevention arXiv version, 2018, July
the text is represented as a sequence of embeddings, over which
we run a series of varying width one-dimensional convolutions
with max-pooling and a pointwise-nonlinearity; the resultant con-
volutional feature maps are concatenated and fed into a multi-layer
perceptron (MLP) with one hidden layer and softmax output. After
training the network, the softmax layer is discarded, and we take
the hidden layer output in the MLP as the word or character feature
vector to train the psychosocial code SVM.
4.2 Image features
We here describe how we extract visual features from the images
that will be fed to the psychosocial code classier.
Local visual concepts. To detect the local concepts dened in
section 3.2, we adopt the Faster R-CNN model [
29
], a state-of-the-
art method for object detection in images. The Faster R-CNN model
introduced a Region Proposal Network (RPN) to produce region
bounds and objectness score at each location of a regular grid. The
bounding boxes proposed by the RPN are fed to a Fast R-CNN [
10
]
detection network. The two networks share their convolutional
features, enabling the whole Faster R-CNN model to be trained
end-to-end and to produce fast yet accurate detections. Faster R-
CNN has been shown [
12
] to be one of the best models among the
modern convolutional object detectors in terms of accuracy. Details
on the training of the model on our data are provided in Section 5.2.
We explore the usefulness of the local visual concepts in two ways:
For each local visual concept detected by the faster R-CNN,
we count the frequency of the concept detected in a given
image. For this, we only consider predictions of the local con-
cept detector with a condence higher than a given threshold,
which is varied in experiments.
In order to get a better idea of the potential usefulness of
our proposed local visual concepts, we add one model to the
experiments that uses ground truth local concepts as features.
This corresponds to features from a perfect local visual con-
cept detector. This method is considered out-of-competition
and is not used for any fusion methods. It is used only to
gain a deeper understanding of the relationship between the
local visual concepts and the psychosocial codes.
Global features. As global image features we process the given
images using a deep convolutional model (Inception-v3 [
32
]) pre-
trained on ImageNet [
6
] and use activations of the last layer before
the classication layer as features. We decided not to update any
weights of the network due to the limited size of our dataset and
because such generic features have been shown to have a strong
discriminative power [27].
4.3 Fusion methods for code detection
In addition to the text- and image-only models that can be obtained
by using individually each feature described in Sections 4.1 and 4.2,
we evaluate several tweet classication models that combine mul-
tiple kinds of features from either one or both modalities. These
approaches always use features of all non-fusion methods for the
respective modalities outlined in Sections 4.1 and 4.2, and combine
information in one of the following two ways:
Early fusion: the dierent kinds of features are concatenated
into a single feature vector, which is then fed into the SVM.
For example, the text-only early fusion model rst extracts
linguistic features and deploys a character and a word level
CNN to compute two 100-dimensional representations of the
text, and then feeds the concatenation of these three vectors
into an SVM for classication.
Late fusion corresponds to an ensemble approach. Here, we
rst train separate SVMs on the code classication task for
each feature as input, and then train another nal SVM to
detect the psychosocial codes from the probability outputs
of the previous SVMs.
5 EXPERIMENTS
Dividing by twitter users
5
, we randomly split our dataset into 5
parts with similar code distributions and total numbers of tweets.
We use these splits for 5-fold cross validation, i.e. all feature repre-
sentations that can be trained and the psychosocial code prediction
models are trained on 4 folds and tested on the unseen 5th fold. All
reported performances and sensitivities are averaged across these
5 data splits. Statements on statistical signicance are based on 95%
condence intervals computed from the 5 values on the 5 splits.
We rst detail how the text and image representations are trained
on our data. We then discuss the performance of dierent uni- and
multimodal psychosocial code classiers. The last two experiments
are designed to provide additional insights into the nature of the
code classication task and the usefulness of specic concepts.
5.1 Learning text representations
Linguistic features. We do not use all the linguistic features de-
scribed in Section 4.1 as input for the SVM but instead during
training apply feature selection using an ANOVA F-test that selects
the top 1
,
300 most important features. Only the selected features
are provided to the SVM for classication. We used the default SVM
hyperparameter settings of [4].
CNN features. We initialize the word embeddings with pretrained
300-dimensional word2vec [
19
] embeddings.
6
For the character level
model, we used 100-dimensional character embeddings randomly
initialized by sampling uniformly from
(−
0
.
25
,
0
.
25
)
. In both CNN
models we used convolutional lter windows of size 1 to 5 with
100 feature maps each. The convolutional lters applied in this
way can be thought of as word (or character) ngram feature detec-
tors, making our models sensitve to chunks of one to ve words
(or characters) long. We use a 100-dimensional hidden layer in the
MLP. During cross-validation we train the CNNs using the Nesterov
Adam [
7
] optimizer with a learning rate of .002, early stopping on
10% of the training fold, and dropout of .5 applied to the embeddings
and convolutional feature maps.
5.2 Learning to detect local concepts
Our local concepts detector is trained using the image data from
Twitter and Tumblr and the corresponding bounding box annota-
tions. We use the Twitter data splits dened above and similarly
5
We chose to do the split on a user basis so that tweets of the same user are not
repeated in both training and test sets.
6https://code.google.com/p/word2vec/
arXiv version, 2018, July P. Blandfort et al.
dene ve splits for the Tumblr data with similar distribution of
concepts across dierent parts. We train a Faster R-CNN
7
model
using a 5-fold cross validation, training using 4 splits of the Twitter
and Tumblr data joined as a training set. We evaluate our local con-
cepts detection model on the joined test set, as well as separately
on the Twitter and Tumblr test set, and will discuss its performance
in section 6.1.
The detector follows the network architecture of VGG-16 and
is trained using the 4-step alternating training approach detailed
in [
29
]. The network is initialized with an ImageNet-pretrained
model and trained for the task of local concepts detection. We use
an initial learning rate of 0
.
001 which is reduced by a factor of
0
.
9every 30k iterations and trained the model for a total of 250k
iterations. We use a momentum of 0
.
8and a weight decay of 0
.
001.
During training, we augment the data by ipping images hori-
zontally. In order to deal with class imbalance while training, we
weigh the classication cross entropy loss for each class by the
logarithm of the inverse of its proportion in the training data. We
will discuss in detail the performance of our detector in Section 6.1.
5.3 Detecting psychosocial codes
We detect the three psychosocial codes separately, i.e. for each code
we consider the binary classication task of deciding whether the
code applies to a given tweet.
For our experiments we consider a tweet to belong to the positive
class of a certain code if at least one annotator marked the tweet
as displaying that code. For the negative class we used all tweets
that were not marked by any annotator as belonging to the code
(but might belong or not belong to any of the two other codes). We
chose this way of converting multiple annotations to single binary
labels because our nal system is not meant to be used as a fully
automatic detector but as a pre-ltering mechanism for tweets that
are potentially useful for social workers. Given that the task of
rating tweets with respect to such psychosocial codes inevitably
depends on the perspective on the annotator to a certain extent, we
think that even in case of a majority voting mechanism, important
tweets might be missed.8
In addition to the models trained using the features described in
Section 4, we also evaluate two baselines that do not process the
actual tweet data in any way. Our random baseline uses the training
data to calculate the prior probability of a sample belonging to
the positive class and for each test sample predicts the positive
class with this probability without using any information about the
sample itself. The other baseline, positive baseline, always outputs
the positive class.
All features except the linguistic features were fed to an SVM
using the RBF kernel for classifying the psychosocial codes. For
linguistic features, due to issues when training with an RBF kernel,
we used a linear SVM with squared hinge loss, as in [
4
], and C =
0
.
01,0
.
03 and 0
.
003 for detecting aggression,loss and substance use
respectively. Class weight was set to balanced, with all other param-
eters kept at their default values. We used the SVM implementation
7
We use the publicly available implementation from: https://github.com/endernewton/
tf-faster- rcnn
8
For future work we are planning to have a closer look at the dierences between
annotations of community experts and students and based on that treat these types of
annotations dierently. We report a preliminary analysis in that direction in Section 6.2.
of the Python library scikit-learn [
26
]. This two stage approach of
feature extraction plus classier was chosen to allow for a better
understanding of the contributions of each feature. We preferred
SVMs in the 2nd stage over deep learning methods since SVMs can
be trained on comparatively smaller datasets without the need to
optimize many hyperparameters.
For all models we report results with respect to the following
metrics: precision, recall and F1-score (always on positive class),
and average precision (using detector scores to rank output). The
former 3 measures are useful to form an intuitive understanding of
the performances, but for drawing all major conclusions we rely
on average precision, which is an approximation of the area under
the entire precision-recall curve, as compared to measurement at
only one point.
The results of our experiments are shown in Table 2. Our re-
sults indicate that image and text features play dierent roles in
detecting dierent psychosocial codes. Textual information clearly
dominates the detection of code loss. We hypothesize that loss is
better conveyed textually whereas substance use and aggression
are easier to express visually. Qualitatively, the linguistic features
with the highest magnitude weights (averaged over all training
splits) in a linear SVM bear this out, with the top ve features for
loss being i) free, ii) miss, iii) bro, iv) love v) you; the top ve features
for substance use being i) smoke, ii) cup, iii) drank, iv) @mention
v) purple; and the top ve features for aggression being i) Middle
Finger Emoji, ii) Syringe Emoji, iii) opps, iv) pipe v) 2017. The loss
features are obviously related to the death or incarceration of a
loved one (e.g. miss and free are often used in phrases wishing some-
one was freed from prison). The top features for aggression and
substance use are either emojis which are themselves pictographic
representations, i.e. not a purely textual expression of the code, or
words that reference physical objects (e.g. pipe, smoke, cup) which
are relatively easy to picture.
Image information dominates classication of both the aggression
and substance use codes. Global image features tend to outperform
local concept features, but combining local concept features with
global image features achieves the best image-based code classi-
cation performance. Importantly, by fusing both image and text
features, the combined detector performs consistently very well
for all three codes, with the mAP over three codes being 0
.
60, com-
pared to 0
.
51 for the text only detector and 0
.
49 for the image only
detector. This demonstrates a relative gain in mAP of around 20%
of the multimodal approach over any single modality.
5.4 Sensitivity analysis
We performed additional experiments to get a better understanding
of the usefulness of our local visual concepts for the code pre-
diction task. For sensitivity analysis we trained linear SVMs on
psychosocial code classication, using as features either the local
visual concepts detected by Faster R-CNN or the ground truth vi-
sual concepts. All reported sensitivity scores are average values of
the corresponding coecients of the linear SVM, computed across
the 5 folds used for the code detection experiments. Results from
this experiment can be found in Table 3.
From classication using ground truth visual features we see
that for detecting aggression, the local visual concepts handgun
Multimodal Social Media Analysis for Gang Violence Prevention arXiv version, 2018, July
Modality Features Fusion Aggression Loss Substance use mAP
P R F1 AP P R F1 AP P R F1 AP
- - (random baseline) - 0.25 0.26 0.26 0.26 0.17 0.17 0.17 0.20 0.18 0.18 0.18 0.20 0.23
- - (positive baseline) - 0.25 1.00 0.40 0.25 0.21 1.00 0.35 0.22 0.20 1.00 0.33 0.20 0.22
text linguistic features - 0.35 0.34 0.34 0.31 0.71 0.47 0.56 0.51 0.25 0.53 0.34 0.24 0.35
text CNN-char - 0.37 0.47 0.39 0.36 0.75 0.66 0.70 0.77 0.27 0.32 0.29 0.28 0.45
text CNN-word - 0.39 0.46 0.42 0.41 0.71 0.65 0.68 0.77 0.28 0.30 0.29 0.31 0.50
text all textual early 0.40 0.46 0.43 0.42 0.70 0.73 0.71 0.81 0.25 0.37 0.30 0.30 0.51
text all textual late 0.43 0.41 0.42 0.42 0.69 0.65 0.67 0.79 0.29 0.37 0.32 0.32 0.51
image inception global - 0.43 0.64 0.51 0.49 0.38 0.57 0.45 0.43 0.41 0.62 0.49 0.48 0.47
image Faster R-CNN local (0.1) - 0.43 0.64 0.52 0.47 0.28 0.56 0.37 0.31 0.44 0.30 0.35 0.37 0.38
image Faster R-CNN local (0.5) - 0.47 0.48 0.47 0.44 0.30 0.39 0.33 0.31 0.46 0.12 0.19 0.30 0.35
image all visual early 0.49 0.62 0.55 0.55* 0.38 0.57 0.45 0.44 0.41 0.59 0.48 0.48 0.49
image all visual late 0.48 0.51 0.49 0.52 0.40 0.51 0.44 0.43 0.47 0.52 0.50 0.51* 0.49
image+text all textual + visual early 0.48 0.51 0.49 0.53 0.72 0.73 0.73 0.82* 0.37 0.53 0.43 0.45 0.60
image+text all textual + visual late 0.48 0.44 0.46 0.53 0.71 0.67 0.69 0.80 0.44 0.43 0.43 0.48 0.60*
Table 2: Results for detecting the psychosocial codes: aggression, loss and substance use. For each code we report precision (P),
recall (R), F1-scores (F1) and average precision (AP). Numbers shown are mean values of 5-fold cross validation performances.
The highest performance (based on AP) for each code is marked with an asterisk. In bold and red we highlight all performances
not signicantly worse than the highest one (based on statistical testing with 95% condence intervals).
Concept Aggression Loss Substance use
0.1 0.5 GT 0.1 0.5 GT 0.1 0.5 GT
handgun 0.73 0.93 1.05 0.06 0.10 0.06 0.06 0.09 0.11
long gun 0.26 0.91 1.30 -0.17 0.14 0.14 0.42 0.04 -0.47
joint 0.42 -0.08 0.05 -0.15 0.00 0.10 0.25 1.3 1.41
marijuana 0.17 0.18 0.12 -0.19 -0.45 -0.35 0.93 1.29 1.47
person 0.34 -0.01 -0.17 0.11 0.10 0.12 0.04 0.28 -0.01
tattoo -0.11 -0.09 0.01 -0.02 0.03 -0.03 0.04 0.06 -0.02
hand gesture 0.20 0.67 0.53 -0.01 0.12 0.05 0.01 0.06 -0.02
lean -0.07 0.03 -0.28 -0.20 -0.06 -0.14 0.68 0.59 1.46
money -0.06 0.06 -0.02 0.00 -0.01 -0.01 0.18 -0.04 -0.19
F1 0.51 0.46 0.65 0.37 0.33 0.38 0.34 0.17 0.76
AP 0.41 0.39 0.54 0.29 0.28 0.30 0.33 0.27 0.72
Table 3: Sensitivity of visual local concept based classiers
w.r.t. the dierent concepts. For each of the three psychoso-
cial codes, we include two versions that use detected local
concepts (“0.1" and “0.5", where the number indicates the de-
tection score threshold) and one version that uses local con-
cept annotations as input (“GT").
and long gun are important, while for detecting substance use, the
concepts marijuana,lean,joint are most signicant. For the code loss,
marijuana as the most relevant visual concept correlates negatively
with loss, but overall, signicance scores are much lower.
Interestingly, the model that uses the higher detection score
threshold of 0
.
5for the local visual concept detection behaves sim-
ilarly to the model using ground truth annotations, even though
the classication performance is better with the lower threshold.
This could indicate that using a lower threshold makes the code
classier learn to exploit false alarms of the concept detector.
However, it needs to be mentioned that sensitivity analysis can
only measure how much the respective classier uses the dierent
parts of the input, given the respective overall setting. This can
give you useful information about which parts are sucient for
obtaining comparable detection results, but there is no guarantee
that the respective parts are also necessary for achieving the same
classication performance. 9
For this reason, we ran an ablation study to get quantitative
measurements on the necessity of local visual concepts for code
classication.
5.5 Ablation study
In our ablation study we repeated the psychosocial code classi-
cation experiment using ground truth local visual concepts as
features, excluding one concept at a time to check how this aects
overall performance of the model.
We found that for aggression, removing the concepts handgun
or hand gesture leads to the biggest drops in performance, while
for substance use, the concepts joint,marijuana and lean are most
important. For loss, removal of none of the concepts causes any
signicant change. See Table 4 for further details.
6 OPEN CHALLENGES
In this section, we provide a more in-depth analysis of what makes
our problem especially challenging and how we plan to address
those challenges in the future.
6.1 Local concepts analysis
We report in Table 5 the average precision results of our local
concept detection approach on the “Complete" test set, i.e. joining
data from both Twitter and Tumblr, and separately on the Twitter
and Tumblr test sets. We compute the average precision on each test
fold separately and report the average and standard deviation values
9
For example, imagine that two hypothetical concepts A and B correlate perfectly with
a given class and a detector for this class is given both concepts as input. The detector
could make its decision based on A alone, but A is not really necessary since the same
could be achieved by using B instead.
arXiv version, 2018, July P. Blandfort et al.
Removed concept Aggression Substance use
F1 AP F1 AP
handgun -0.10 -0.15 -0.01 0.01
long gun -0.01 -0.01 -0.00 -0.00
joint 0.00 -0.00 -0.35 -0.28
marijuana 0.00 0.00 -0.09 -0.09
person -0.01 -0.01 -0.01 -0.00
tattoo 0.00 0.00 0.01 -0.00
hand gesture -0.13 -0.09 0.00 0.00
lean -0.00 0.00 -0.07 -0.07
money 0.00 0.00 0.00 0.00
Table 4: Dierences in psychosocial code detection perfor-
mance of detectors with specic local concepts removed as
compared to a detector that uses all local concept annota-
tions. (Numbers less than 0 indicate that removing the con-
cept reduces the corresponding score.) Bold font indicates
that the respective number is signicantly less than 0. For
the code loss none of the numbers was signicantly dier-
ent from 0, hence we decided to not list them in this table.
Concept Complete Twitter Tumblr
AP ±SD AP ±SD AP ±SD
handgun 0.30 ±0.07 0.13 ±0.02 0.74 ±0.11
long gun 0.78 ±0.03 0.29 ±0.41 0.85 ±0.05
joint 0.30 ±0.07 0.01 ±0.01 0.57 ±0.04
marijuana 0.73 ±0.08 0.28 ±0.17 0.87 ±0.09
person 0.80 ±0.03 0.80 ±0.03 0.95 ±0.03
tattoo 0.26 ±0.06 0.08 ±0.02 0.84 ±0.06
hand gesture 0.27 ±0.05 0.28 ±0.04 0.83 ±0.29
lean 0.78 ±0.07 0.38 ±0.15 0.87 ±0.03
money 0.60 ±0.02 0.35 ±0.08 0.73 ±0.05
mAP 0.54 ±0.01 0.29 ±0.05 0.81 ±0.02
Table 5: Local concepts detection performance.
over the 5 folds. When looking at the results on the “Complete" test
set, we see average precision values ranging from 0
.
26 on tattoo to
0
.
80 for person and the mean average precision of 0
.
54 indicating a
rather good performance. This results on the “Complete" test set
hides two dierent stories, however, as the performance is much
lower on the Twitter test set (mAP of 0
.
29) than on the Tumblr one
(mAP of 0.81).
As detailed in Section 3.4, we have crawled additional images,
especially targeting the concepts with a low occurrence count in
Twitter data as detailed in Table 1. However, crawling images from
Tumblr targeting keywords related to those concepts lead us to
gather images where the target concept is the main subject in the
image, while in our Twitter images they appear in the image but are
rarely the main element in the picture. Further manually analyzing
the images crawled from Twitter and Tumblr, we have conrmed
this “domain gap" between the two sources of data that can explain
the dierence of performance. This puts in light the challenges
associated with detecting these concepts in our Twitter data. We
believe the only solution is therefore to gather additional images
from Twitter from similar users. This will be part of the future work
of this research.
The local concepts are highly relevant for the detection of the
codes aggression and substance use as it can be highlighted in the
column GT in Table 3 and from the ablation study reported in Ta-
ble 4. The aforementioned analysis of the local concepts detection
limitation on the Twitter data explains why the performance us-
ing the detected concepts is substantially lower than when using
ground truth local concepts. We will therefore continue to work on
local concepts detection in the future as we see they could provide
signicant help in detecting these two codes and also because they
would help in providing a clear interpretability of our model.
6.2 Annotation analysis
In order to identify factors that led to divergent classication be-
tween social work annotators and domain experts, we reviewed
10% of disagreed-upon tweets with domain experts. In general,
knowledge of local people, places, and behaviors accounted for the
majority of disagreements. In particular, recognizing and having
knowledge of someone in the image (including their reputation,
gang aliation, and whether or not they had been killed or incar-
cerated) was the most common reason for disagreement between
our annotators and domain experts. Less commonly, identifying
or recognizing physical items or locations related to the specic
cultural context of the Chicago area (e.g., a home known to be used
in the sale of drugs) also contributed to disagreement. The domain
experts’ nuanced understanding of hand signs also led to a more
rened understanding of the images, which variably increased or
decreased the perceived level of aggression. For example, knowl-
edge that a certain hand sign is used to disrespect a specic gang
often resulted in increased perceived level of aggression. In contrast,
certain hand gestures considered to be disrespectful by our social
work student annotators (e.g., displaying upturned middle ngers)
were perceived to be neutral by domain experts and therefore not
aggressive. Therefore, continuous exchange with the domain ex-
perts is needed to always ensure that the computer scientists are
aware of all these aspects when further developing their methods.
6.3 Ethical implications
Our team was approached by violence outreach workers in Chicago
to begin to create a computational system that would enhance vi-
olence prevention and intervention. Accordingly, our automatic
vision and textual detection tools were created to assist social work-
ers in their eorts to understand and prevent community violence
through social media, but not to optimize any systems of surveil-
lance. This shift away from identifying potentially violent users to
understanding pathways to violent online content highlights sys-
temic gaps in economic, educational, and health-related resources
that are often root causes to violent behavior. Our eorts for ethi-
cal and just treatment of the users who provide our data include
encryption of all Twitter data, removal of identifying information
during presentation of work (e.g., altering text to eliminate search-
ability), and the inclusion of Chicago-based community members
as domain experts in the analysis and validation of our ndings.
Our long term eorts include using multimodal analysis to enhance
Multimodal Social Media Analysis for Gang Violence Prevention arXiv version, 2018, July
current violence prevention eorts by providing insight into social
media behaviors that may shape future physical altercations.
7 CONCLUSION
We have introduced the problem of multimodal social media analy-
sis for gang violence prevention and presented a number of auto-
matic detection experiments to gain insights into the expression
of aggression,loss and substance use in tweets coming from this
specic community, measure the performance of state-of-the-art
methods on detecting these codes in tweets that include images,
and analyze the role of the two modalities text and image in this
multimodal tweet classication setting.
We proposed a list of general-purpose local visual concepts and
showed that despite insucient performance of current local con-
cept detection, when combined with global visual features, these
concepts can help visual detection of aggression and substance use in
tweets. In this context we also analyzed in-depth the contribution
of all individual concepts.
In general, we found the relevance of the text and image modali-
ties in tweet classication to depend heavily on the specic code
being detected, and demonstrated that combining both modalities
leads to a signicant improvement of overall performance across
all 3 psychosocial codes.
Findings from our experiments arm prior social science re-
search indicating that youth use social media to respond to, cope
with, and discuss their exposure to violence. Human annotation,
however, remains an important element in vision detection in order
to understand the culture, context and nuance embedded in each
image. Hence, despite promising detection results, we argue that
psychosocial code classication is far from being solved by auto-
matic methods. Here our interdisciplinary approach clearly helped
to become aware of the whole complexity of the task, but also to
see the broader context of our work, including important ethical
implications which were discussed above.
ACKNOWLEDGMENTS
During his stay at CU, the rst author was supported by a fellow-
ship within the FITweltweit programme of the German Academic
Exchange Service (DAAD). Furthermore, we thank all our anno-
tators: Allison Aguilar, Rebecca Carlson, Natalie Hession, Chloe
Martin, Mirinda Morency.
REFERENCES
[1]
2017. Click, Pre Crime, Chicago’s crime-predicting software. (May 2017). http:
//www.bbc.co.uk/programmes/p052l7
[2]
2017. Strategic Subject List | City of Chicago | Data Portal. (2017). https:
//data.cityofchicago.org/Public-Safety/Strategic- Subject-List/4aki-r3np Online;
Accessed April 2018.
[3]
R Atkinson and J Flint. 2001. Accessing Hidden and Hard-to-reach Populations:
Snowball Research Strategies. Social Research Update 33 (2001). http://eprints.
gla.ac.uk/37493/
[4]
Terra Blevins, Robert Kwiatkowski, Jamie Macbeth, Kathleen McKeown,
Desmond Patton, and Owen Rambow. 2016. Automatically processing Tweets
from gang-involved youth: Towards detecting loss and aggression. In Proceedings
of COLING 2016, the 26th International Conference on Computational Linguistics:
Technical Papers. 2196–2206.
[5]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via
region-based fully convolutional networks. In Advances in neural information
processing systems. 379–387.
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Im-
ageNet: A large-scale hierarchical image database.. In CVPR. IEEE Computer
Society, 248–255.
[7] Timothy Dozat. 2016. Incorporating nesterov momentum into adam. (2016).
[8]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010.
The Pascal Visual Object Classes (VOC) Challenge. International Journal of
Computer Vision 88, 2 (June 2010), 303–338.
[9]
Matthew S. Gerber. 2014. Predicting crime using Twitter and kernel density
estimation. Decision Support Systems 61 (2014), 115 – 125.
DOI:
http://dx.doi.org/
https://doi.org/10.1016/j.dss.2014.02.003
[10] Ross Girshick. 2015. Fast R-CNN. In Computer Vision (ICCV), 2015 IEEE Interna-
tional Conference on. IEEE, 1440–1448.
[11]
Alex Hanna. 2017. MPEDS: Automating the Generation of Protest Event Data.
(2017).
[12]
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and
others. 2017. Speed/accuracy trade-os for modern convolutional object detectors.
In IEEE CVPR.
[13]
Max Kapustin, Jens Ludwig, Marc Punkay, Kimberley Smith, Lauren Speigel, and
David Welgus. 2017. Gun Violence in Chicago, 2016. Chicago, IL: University of
Chicago Crime Lab (2017).
[14]
Katherine A Keith, Abram Handler, Michael Pinkham, Cara Magliozzi, Joshua
McDue, and Brendan O’Connor. 2017. Identifying civilians killed by police
with distantly supervised entity-event extraction. arXiv preprint arXiv:1707.07086
(2017).
[15]
Yoon Kim. 2014. Convolutional neural networks for sentence classication. arXiv
preprint arXiv:1408.5882 (2014).
[16]
Jerey Lane. 2016. The digital street: An ethnographic study of networked street
life in Harlem. American Behavioral Scientist 60, 1 (2016), 43–58.
[17]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common
objects in context. In European conference on computer vision. Springer, 740–755.
[18]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.
In European conference on computer vision. Springer, 21–37.
[19]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jerey Dean. 2013.
Distributed Representations of Words and Phrases and Their Compositionality. In
Proceedings of the 26th International Conference on Neural Information Processing
Systems - Volume 2 (NIPS’13). Curran Associates Inc., USA, 3111–3119. http:
//dl.acm.org/citation.cfm?id=2999792.2999959
[20]
A. Nellis, J.A. Greene, M. Mauer, and Sentencing Project (U.S.). 2008. Reduc-
ing Racial Disparity in the Criminal Justice System: A Manual for Practitioners
and Policymakers. Sentencing Project. https://books.google.com/books?id=
MQKznQEACAAJ
[21]
Citizens Crime Commission of New York City. 2017. E-Responder: a brief
about preventing real world violence using digital intervention. (2017). www.
nycrimecommission.org/pdfs/e-responder- brief-1.pdf
[22]
Desmond Upton Patton, Robert D Eschmann, and Dirk A Butler. 2013. Internet
banging: New trends in social media, gang violence, masculinity and hip hop.
Computers in Human Behavior 29, 5 (2013), A54–A59.
[23]
Desmond U Patton, Jerey Lane, Patrick Leonard, Jamie Macbeth, and Jocelyn R
Smith Lee. 2017. Gang violence on the digital street: Case study of a South Side
Chicago gang member’s Twitter communication. new media & society 19, 7 (2017),
1000–1018.
[24]
Desmond Upton Patton, Kathleen McKeown, Owen Rambow, and Jamie Macbeth.
2016. Using Natural Language Processing and Qualitative Analysis to Intervene
in Gang Violence: A Collaboration Between Social Work Researchers and Data
Scientists. arXiv preprint arXiv:1609.08779 (2016).
[25]
Ellie Pavlick, Heng Ji, Xiaoman Pan, and Chris Callison-Burch. 2016. The Gun
Violence Database: A new task and data set for NLP. In Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing. 1018–1024.
[26]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[27]
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
2014. CNN features o-the-shelf: an astounding baseline for recognition. In Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference
on. IEEE, 512–519.
[28]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unied, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition. 779–788.
[29]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: to-
wards real-time object detection with region proposal networks. IEEE transactions
on pattern analysis and machine intelligence 39, 6 (2017), 1137–1149.
[30]
Christine Schmidt. 2018. Holding algorithms (and the people behind them)
accountable is still tricky, but doable. (Mar 2018). http://nie.mn/2ucDuw2 Online;
Accessed April 2018.
arXiv version, 2018, July P. Blandfort et al.
[31]
Karen Sheley. 2017. Statement on Predictive Policing in Chicago. (Jun 2017). http:
//www.aclu-il.org/en/press-releases/statement- predictive-policing- chicago
[32]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jonathon Shlens, and Zbig-
niew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,
Las Vegas, NV, USA, June 27-30, 2016. 2818–2826. DOI:http://dx.doi.org/10.1109/
CVPR.2016.308
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Conference Paper
Full-text available
The U.S. has the highest rate of firearm related deaths when compared to other industrialized countries. Violence particularly affects low-income, urban neighborhoods in cities like Chicago, which saw a 40% increase in firearm violence from 2014 to 2015 to more than 3,000 shooting victims. While recent studies have found that urban, gang-involved individuals curate a unique and complex communication style within and between social media platforms, organizations focused on reducing gang violence are struggling to keep up with the growing complexity of social media platforms and the sheer volume of data they present. In this paper, we will discuss a collaboration between data scientists and social work researchers to develop a suite of systems for decoding the high-stress language of urban, gang-involved youth. Our approach leverages a grounded methods qualitative analysis of nearly 1000 tweets posted by Chicago gang members and participation of youth from Chicago neighborhoods to create a language resource for natural language processing (NLP) methods. In uncovering the unique language and communication style, we developed automated tools with the potential to detect aggressive language on social media and aid individuals and groups in performing violence prevention and interruption.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the $backslash$overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the $backslash$overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the $backslash$overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We propose a new, socially-impactful task for natural language processing: from a news corpus, extract names of persons who have been killed by police. We present a newly collected police fatality corpus, which we release publicly, and present a model to solve this problem that uses EM-based distant supervision with logistic regression and convolutional neural network classifiers. Our model outperforms two off-the-shelf event extractor systems, and it can suggest candidate victim names in some cases faster than one of the major manually-collected police fatality databases.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.