Content uploaded by Tong Wang

Author content

All content in this area was uploaded by Tong Wang on Mar 15, 2016

Content may be subject to copyright.

ORIGINAL ARTICLE

Finding Patterns with a Rotten Core:

Data Mining for Crime Series with Cores

Tong Wang,

1

Cynthia Rudin,

1,

*Daniel Wagner,

2

and Rich Sevieri

2

Abstract

One of the most challenging problems facing crime analysts is that of identifying crime series,whicharesetsof

crimes committed by the same individual or group. Detecting crime series can be an important step in predictive

policing, as knowledge of a pattern can be of paramount importance toward ﬁnding the offenders or stopping the

pattern. Currently, crime analysts detect crime series manually; our goal is to assist them by providing automated

tools for discovering crime series from within a database of crimes. Our approach relies on a key hypothesis that

each crime series possesses at least one core of crimes that are very similar to each other, which can be used to

characterize the modus operandi (M.O.) of the criminal. Based on this assumption, as long as we ﬁnd all of the

cores in the database, we have found a piece of each crime series. We propose a subspace clustering method,

where the subspace is the M.O. of the series. The method has three steps: We ﬁrst construct a similarity graph to

link crimes that are generally similar, second we ﬁnd cores of crime using an integer linear programming approach,

and third we construct the rest of the crime series by merging cores to form the full crime series. To judge whether a

set of crimes is indeed a core, we consider both pattern-general similarity,whichcanbelearnedfrompastcrime

series, and pattern-speciﬁc similarity, which is speciﬁc to the M.O. of the series and cannot be learned. Our method

can be used for general pattern detection beyond crime series detection, as cores exist for patterns in many domains.

Key words: crime series detection; subspace clustering; clustering with feature selection; pattern mining; dense

core sets; similarity graph

Introduction

One of the most important problems in crime analysis

is that of crime series detection, or the detection of a set

of crimes committed by the same individual or group.

Criminals follow a modus operandi (M.O.) that char-

acterizes their crime series; for instance, some crimi-

nals operate exclusively during the day, others work

at night, some criminals target apartments for house-

breaks, while others target single family houses. Crime an-

alysts need to identify these crime series within police

databases at the same time as they identify the M.O.

Currently, analysts identify crime series by hand:

They manually search through the data using database

queries trying to locate patterns, which (as Ref. notes)

can be very challenging and time-consuming. From a

computational perspective, crime series detection is a

very difﬁcult problem: It is a clustering problem with

cluster-speciﬁc feature selection, where the set of fea-

tures for the cluster is the M.O. of the criminal(s).

One cannot know the M.O. of the series without deter-

mining which crimes were involved, and one cannot

locate the set of crimes without knowing the M.O.—

the M.O. and the set of crimes need to be determined

simultaneously. Pattern analysis for crime has existed

at least as far back as the 1840s,

2

but recently, big

data has changed the whole landscape for crime analy-

sis. To locate crime patterns, analysts now rely on large

amounts of very detailed data about large numbers of

past crimes. The computational problem of ﬁnding

crime series grows exponentially with the numbers of

1

Massachusetts Institute of Technology, Cambridge, Massachusetts.

2

Cambridge Police Department, Cambridge, Massachusetts.

*Address correspondence to: Cynthia Rudin, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, E-mail: rudin@mit.edu

ªTong Wang et al. 2015; Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons

License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided

the original work is properly credited.

Big Data

Volume 3 Number 1, 2015

Mary Ann Liebert, Inc.

DOI: 10.1089/big.2014.0021

3

crimes and features of crimes. Even if a crime analyst

could identify crime series within his/her own depart-

ment database (which is already difﬁcult), these manual

efforts cannot scale, for instance, when neighboring po-

lice departments start to combine databases.

Despite its critical importance for public safety, there

is little in the way of tools to help police. Most predic-

tive policing software has the capability to detect only

general background levels of crime density in time

and space, which are much easier to predict than spe-

ciﬁc patterns of crime. This is called hotspot prediction,

and it requires only time and location data, and a density

estimation algorithm: Crimes are predicted to occur

based on where and when they occurred in the past.

No detailed information about the M.O. of the crimes

is used for hotspot prediction, and hotspots can involve

crimes committed by different offenders (as opposed

to crime series, which have the same offenders). In

fact, at least in the case of Cambridge, Massachusetts,

crime series generally do not take place in hotspots

(see also Ref.

3

for a study of crime series in two other

U.S. cities). Hotspot prediction software is not what

we need for the problem of crime series detection.

The problem of crime series detection is much more

data intensive and computationally harder than hotspot

prediction. To determine the M.O., we require very ﬁne-

grained detailed information about past crimes: where

the offender(s) entered (front door, window, etc.); how

they entered (pried doors, forced doors, unlocked doors,

pushed in air conditioner, etc.), whether they ransacked

thepremise,thetypeofpremise,etc.Thisishighdi-

mensional structured data, leading to a computationally

challenging data mining problem of ﬁnding all crimes

that are similar to each other in several ways.

Crime series detection results can be used for many

purposes. First, if a pattern is identiﬁed, investigative re-

sources can be prioritized to focus on the crime series,

to gather and assemble evidence. For instance, video re-

cordings from nearby streets and stores or banks can be

examined for evidence of the offender’s location with

respect to all crimes in the series, and combined with

suspect descriptions from witnesses (if any). Call-detail

records from cell phones can also be used for this pur-

pose, as well as latent ﬁngerprints and tracking infor-

mation from stolen property. Second, if a current

crime series is localized enough to have predictable

times and locations (for instance, a pickpocket targeting

asinglecafe

´at regular intervals throughout the week),

actions can be taken to stop the pattern. Third, if a cur-

rent crime series has the same M.O. as a past crime

series for which the offender is known, then the suspect

from the older series could be a potential offender for

the current crime series. Fourth, crime series detection

results can be used to study criminal behavior gener-

ally. There is strong evidence that a majority of crimes

are committed by a small number of serial criminals,

4

which underscores the importance of identifying and

studying the patterns of serial offenders.

Conversely, we remark that nothing can be done if

police do not know that a pattern is occurring at all.

Without the capability of automatically detecting a spe-

ciﬁc series of crime, it is possible that crime series may

take much longer to identify, or may never be identi-

ﬁed. This is especially problematic for certain types of

crime—for instance, for housebreaks (burglaries) there

is often no suspect information since the crimes take

place when residents are not present. Housebreaks

can be extremely difﬁcult crimes to solve, and nation-

wide only 14% of housebreaks are solved.

5

In this arti-

cle, we aim directly at identifying series in housebreaks

in an automated way.

The main hypothesis in this work is that most crime

patterns have a core of crimes that exemplify the M.O.

of the series. A core might be approximately three or

four crimes that are very similar to each other in

many different ways. According to our hypothesis that

each crime series has a core, if we can locate all small

cores of crime within the dataset, we have thus located

pieces of most crime series. This hypothesis is based on

the intuition of analysts and has the dual purpose of

assisting with computation: We can indeed consider

all reasonable small subsets to calculate whether they

are plausible cores.

Though we cannot determine the M.O. of a series

before we see it, we can characterize parts of the

M.O.s we expect generally—for instance, crimes in a

series are often close in time and space. We deﬁne a

pattern-general similarity that encodes factors that are

generally common to most crime series. It is learned

from past crime series; for instance, proximity in time

and space highly contribute to the pattern-general sim-

ilarity. The pattern-general similarity induces a similar-

ity graph over the set of crimes, where there is an edge

between two crimes when they are similar according to

the pattern-general similarity.

Crimes in a series also have pattern-speciﬁc similar-

ity, where crimes are similar to each other if they all

share the same M.O. The pattern-speciﬁc similarity

and pattern-general similarity (the similarity graph in

particular) are used for detecting cores: A core must

4 WANG ET AL.

have both sufﬁciently large pattern-general similarity

and pattern-speciﬁc similarity. The cores are found

using an integer linear programming (ILP) formula-

tion. Once the cores are found, we construct the full

crime series by merging overlapping cores. We also

prove that merging cores preserves desired properties.

Our method is a general subspace clustering method

that can be used beyond crime series, and for other com-

pletely different application domains. It is novel in that it

considers both pattern-general and pattern-speciﬁc as-

pects of patterns, where ‘‘pattern-general’’ means that

it is common to many crime series (supervised from

past clusters), and ‘‘pattern-speciﬁc’’ meaning aspects

of a particular crime series (unsupervised). Our method

does not force all examples to be part of a cluster and

can thus accommodate background (non-series) crime.

Our method also can ﬁnd subspace clusters whose sub-

space morphs dynamically over the cluster; an M.O. can

change when an offender becomes more sophisticated,

adapts his M.O. toavoid detection, or when his preferred

method is not available (e.g., he prefers entry through

unlocked rear doors but will push in the air conditioner

if his preferred means of entry is not available).

This project highlights several important aspects of

big data analytics. It highlights an emphasis on human

collaboration within the analysis pipeline, which is a

key challenge for big data (see, for instance, the CCC

white papers

6

). This project also exempliﬁes a separate

challenge, namely that of increased complexity. Once

our problem is formulated, solving it requires a combi-

natorially hard optimization problem and an explosion

of variables. We could, for instance, add a ‘‘V’’ for ‘‘Var-

iables’’ to the traditional V’s of big data in order to

capture problems that require massive computation

to solve a problem of increased complexity. The big

data challenge of problems with increased complexity

has arisen in other places (see, for instance, the Amer-

ican Statistical Association’s white paper on big data

7

).

Our three methodology sections follow the three

main components of our method: learn a similarity

graph,mine cores,andmerge cores.Wethenshowex-

periments where our method was tested on the

full housebreak database from the Cambridge Police

Department containing detailed information from

thousands of crimes from over a decade. This method

has been able to provide new insights into true patterns

of crime committed in Cambridge. One observation

that has been revealed in our experiments is that com-

puters do not have the same biases that humans do.

Computers search through the database in a different

way than a human would, and thus can work symbiot-

ically with analysts, showing them avenues to consider

where they would not have normally ventured.

Related work

There are at least three main types of recent approaches

to identifying crimes committed by the same individual

or group (according to Ref.

8

).

The ﬁrst approach, sometimes known as pairwise

case linkage, involves identifying whether a pair of

crimes was committed by the same group, where

each pair is considered separately. Some of these works

use weights determined by experts

9–12

to weight the

various types of similarities between crimes, and other

works learn the similarity weights from data.

13–23

The

problem with considering only pairwise linkages is

that only one similarity measure between crimes is

used. This has a fundamental ﬂaw that it does not

consider M.O.’s of individual crime series. Consider

two crime series with different M.O.’s: one has an

M.O. in which the means of entry is to push in an

air conditioner on a ground ﬂoor apartment (in build-

ings that are not concentrated geographically), and

the M.O. of the other crime series involves breaking

into a single family home at night (not using a consis-

tent entry method) while the residents are present in a

small geographical area. If we are investigating whether

a new crime ﬁts into the ﬁrst series, logically we should

consider whether the suspect entered by pushing in an

air conditioner of an apartment and we should not

consider geographical area. When we are investigating

the second series, we should consider geographical

area, time of day, and whether residents were present;

we should not consider means or location of entry. If

we use a single measure to judge similarity between

any two crimes (as in pairwise case linkage ap-

proaches), it is not possible to make this distinction.

In pairwise linkage approaches, it is not possible to ig-

nore certain types of information about similarity and

pay attention to others, as we claim is necessary for

understanding multiple different M.O.’s. As a result,

these approaches generally ﬁnd that time and space

are the only relevant dimensions for this type of anal-

ysis (see for instance Ref.

20

), and end up ignoring the

detailed behavioral information.

The second type of approach is called reactive linkage,

where a crime series is discovered one crime at a time,

starting from a seed of one or more crimes (as deﬁned

by Refs.

1,8

). Reactive linkage has a similar problem to

pairwise linkage in that when we start to grow a set

DATA MINING FOR CRIME SERIES 5

of crimes greedily from a small seed of one or two

crimes, we cannot yet know the M.O. of the crime se-

ries. This means that yet again we start from a common

similarity metric between crimes. The greediness of this

approach, coupled with the fact that the distance metric

does not take into account the M.O., could lead to

problematic results. Our past approach to this

24,25

did

use a theoretically motivated

26

greedy method, but

adapted the distance metric to be closer to the observed

M.O. as more crimes were added to the set. This still

does not solve the problem of greediness, and the result

could depend on which crimes were used as seed crimes.

On the other hand, these greedy methods are very com-

putationally tractable.

In the last type of approach, crime series clustering,all

the clusters are found simultaneously.

8,27–35

One of the

earliest approaches we know of for clustering crimes is

that of Dahbur and Muscarello,

29

who used a neural net-

work approach. (This method had some serious ﬂaws

that required extensive heuristic post-processing after

the clusters were created, but aimed at solving the

more general problem of crime clustering.) Most of

these approaches use a form of hierarchical clustering,

which again has the disadvantage that the distance met-

ric between crimes is static and does not reﬂect the M.O.

of any particular crime series. (On the other hand, hier-

archical clustering is very computationally tractable

and could be made to reﬂect pattern-general similar-

ity, though not pattern-speciﬁc similarity.) Some of

these approaches reduce features one at a time through

hypothesis tests, or use basic dimensionality reduction

(multidimensional scaling) techniques before cluster-

ing. This still does not handle pattern-speciﬁc aspects,

and thus, cannot capture the M.O.

A main point of the present work is that in order to

model the M.O., we need to use some form of subspace

clustering. The M.O. for the pattern is precisely the sub-

space; it is the set of dimensions for which crimes in a

particular series should be considered to be similar.

For the ﬁrst crime series in the example above, we

wouldconsidermainlythesubspace‘‘meansof

entry’’ involving the entry by pushing in air condition-

ers, and we would not heavily consider the dimension

for geographical area. The algorithm must determine

which observations go into which cluster, and which

subspace is relevant for each cluster. Our work relates

to various subﬁelds of clustering (see, for instance,

Ref.

36

), including pattern-based clustering,

37,38

which

is a semi-supervised approach (unlike ours—we do not

use test data at training time); general subspace cluster-

ing (e.g., Refs.

39,40

), which detects all clusters in all sub-

spaces; and work at the intersection of dense subgraph

mining and pattern mining in feature graphs.

41,42

These methods would not be able to take into account

the complexities we handle, for instance, learning the

pattern-general weights, or ﬁnding cores with the char-

acteristics we require. Other work on space-time event

detection is relevant,

43,44

where the goal is to detect pat-

terns such that the frequency of records is higher than

an expected frequency. A relevant method for cluster-

ing with simultaneous feature selection is that of

Guan et al.,

45

which is similar to our core detector in

ﬂavor, although feature selection is controlled differ-

ently, and there are no pattern-general aspects.

Reviews of other data mining applications in crime

analysis are those of Chen et al.

46

and Thongtae and

Srisuk.

47

There are several other works that use machine

learning for various criminology applications.

48–50

Model Formulation

Our data consist of entities (crimes) V, each of which

has a vector of features. We deﬁne D

j

as the set of pos-

sible values for feature j,j=1...J. In our speciﬁc da-

tabase we have features including time, space, whether

the crime occurred on a weekday, location of entry

(front door, back door, ground window, etc.), means

of entry (shoved, pried, forced, etc.), type of premise

(apartment, single-family house, etc.), indicator vari-

able for ‘‘ransacked,’’ suspect information (which is

rarely present), and so on. Although the approach we

introduce below is general and can be applied to prob-

lems beyond crime data mining, we use terminology

speciﬁc to our problem in our exposition.

Deﬁne s

j

to be a symmetric similarity function on the

jth feature s

j

:D

j

·D

j

/[0, 1], where s

j

(v

l

,v

k

) measures

the similarity between crimes v

‘

and v

k

in the jth fea-

ture. In our case most features are categorical, some

are ranges, and some are spatial coordinates. For in-

stance, if two crimes have location of entry as ‘‘win-

dow,’’ then they would get a high location-of-entry

similarity. The similarity measures are discussed in

depth in our earlier work,

24,25

so we will not go into

detail here. The average similarity of a set of crimes

in feature jis deﬁned as the cohesion of the set:

Deﬁnition 1. (Cohesion

j

) For a set of crimes V, the

cohesion in the jth feature is the mean of pairwise

similarities,

Cohesionj(V)=1

jVj(jVj1)+

vl,vk2V

sj(vl,vk):

6 WANG ET AL.

The deﬁning features of a pattern are those with suf-

ﬁciently high cohesion.

Deﬁnition 2. (Deﬁning feature) Deﬁning features

for V are those that satisfy Cohesion

j

(V)‡h

j

. The set

of deﬁning features for set V is denoted by L(V).

The deﬁning features characterize the M.O. of the

crime series. If several housebreaks happen in the

same neighborhood, around the same time of day,

within the same month, and the location of entry is al-

ways a window, regardless of the differences in other

features, these similarities would indicate that the crimes

could have been committed by the same offender. The

features ‘‘geographic location,’’ ‘‘time of day,’’ ‘‘time be-

tween crimes,’’ and ‘‘location of entry’’ characterize the

M.O. for this particular crime series. When we later de-

ﬁne cores, the pattern-speciﬁc statistic of interest is the

number of deﬁning features of V.

Let us switch from pattern-speciﬁc deﬁnitions to

pattern-general deﬁnitions. We will learn a pattern-

general similarity function from past crime series.

Pattern-general similarity allows us to weight impor-

tant features highly; for instance, crimes that are

spread very far apart in time and space are unlikely

to be a pattern, and thus we will learn from past

crimeseriesthattimeandspaceareimportantfea-

tures. We will learn a set of weights [k1,...,kj,

...kJ]from past crime data that will provide the im-

portance of each feature within a linear combination.

We will search for cores that have high pattern-gen-

eral cohesion, deﬁned as follows:

Deﬁnition 3. (Pattern-general cohesion) The pattern-

general cohesion of V is the weighted sum of cohesions

over the features: +J

j=1kjCohesionj(V).

Our cores will also be connected in the similarity

graph. To construct edges for the graph, we ﬁrst deﬁne

a metric to measure the similarity between two crimes,

as follows:

Deﬁnition 4. (Pattern-general similarity) The

pattern-general similarity is a weighted sum of similar-

ity measures for each feature, using the pattern-general

weights [k1,k2,...kJ].

c(v‘,vk)=+

J

j=1

kjsj(v‘,vk):(1)

We deﬁne the similarity graph to contain edges be-

tween crimes that are sufﬁciently close in the pattern-

general sense.

Deﬁnition 5. (Similarity graph) A similarity graph is an

undirected graph G=(V,E),where V=fv1,v2,...vngis

thenodeset,Eis the edge set, E=ffv‘,vkgjc(v‘,vk)q

D,v‘,vk2V,v‘6¼ vkg.

For us to even consider whether Vshould be a core

of a series, the core needs to be connected in the graph

theoretic sense (and is not composed of two separate

clusters for instance), must have sufﬁcient connectivity,

and must have sufﬁciently high pattern general cohe-

sion. The requirement that a core must be a connected

set means that each crime in a core must be similar in

pattern-general similarity to at least one other crime in

the core.

We learn both the weights {k

j

}

j

and cut-off threshold

Dfrom data, as discussed in Section 3 below, in order to

perform the ﬁrst piece of the method, which is to con-

struct the similarity graph. Note that we could eliminate

the ﬁrst piece of the method, and eliminate the pattern-

general similarity all together by setting the threshold D

to be very low, and that way we consider only pattern-

speciﬁc similarity; this, however, would not ease compu-

tation or restrict the cores to be similar to those from

other patterns. Imposing a graph structure on the crimes

eases computation, in the sense that we are now only

looking for connected sets of the graph when we mine

for cores in the second part of the method.

The second part of the method is to mine for cores.

When we mine for cores, we cannot simply maximize

the number of deﬁning features and/or the pattern-

general cohesion over all subsets of crime, as it would

favor choosing very small sets of crime as cores. To al-

leviate this problem, we specify the size of the core jVj

and the number of deﬁning features d, and ﬁnd cores

that maximize the pattern-general cohesion. Once we

ﬁnd the cores, we merge them to ﬁnd the rest of the

pattern. We call patterns formed by merging other pat-

terns together serieslike patterns. An illustration of a

serieslike pattern is shown in Figure 1.

Learning the Similarity Graph

As we discussed, the similarity graph is constructed by

connecting pairs of nodes with pattern-general cohesion

above a threshold D. The pattern-general cohesion cis

deﬁned in (1) as a weighted sum of pairwise similarities

in different features, with pattern-general weights

k2R

J. The set of coefﬁcients kand Dare parameters

that we learn from data. These data consist of 51 histor-

ical crime series that have been identiﬁed as true crime

series by crime analysts.

DATA MINING FOR CRIME SERIES 7

To learn the weights, we optimize over the historical

training patterns to make them as close as possible to

being connected subgraphs. This means we want each

crime in a historical pattern to be close to at least one

other crime in the same pattern. At the same time,

we want crimes within a historical pattern to be distant

from crimes not in the same pattern. The condition for

crime ‘to be close to at least one other crime within its

pattern (with some slack e

‘

) is:

max

fk:k2pattern(‘)g+

J

j=1

kjsj(‘,k)qD‘:(2)

Conversely, crimes that do not belong to the same

pattern should not be very similar:

+

J

j=1

kjsj(‘,k)pDþn‘k, (3)

true for all ‘and ksuch that k=2pattern(‘). Our goal is

to minimize the total weighted slack. Informally, we

compute:

min

k,D+

crimes in the

same pattern

slack þC1+

crimes not in the

same pattern

slack þC0kkk0,

where kkk

0

is the ‘

0

semi-norm of k, which encourages

sparsity in k. The constant C

1

is set very small, so most

edges that should be present will be present, and we

consider removal of unnecessary edges as a secondary

goal. Removal of these unnecessary edges is where

we will gain a computational beneﬁt later on. The

more edges we eliminate, the fewer connected subsets

we need to evaluate as being possible cores.

We propose a mixed-integer programming (MIP)

formulation for solving this. In what follows, decision

variables y

‘k

are binary, and they select a crime kthat

is most similar to a given crime ‘within the same pat-

tern. That is, they encode the max from constraint (2).

The formulation is:

min

k,D,fy‘,kg‘,k,fbjgj

+

‘2f1, 2...mg

‘þC1+

‘f1, 2...mg

fk:k=2pattern(‘)g

n‘kþC0+

J

j=1

bj

such that

+

J

j=1

kjsj(‘,k)q(D‘)þM(y‘k1) 8‘8ks:t:k2pattern(‘)

(4)

+

k:k2pattern(‘)

y‘k=18‘(5)

y‘k2f0, 1g8‘,k(6)

+

J

j=1

kjsj(‘,k)pDþn‘k8‘,8ks:t:k=2pattern(‘) (7)

+

J

j=1

kj=1 (8)

kjq08j(9)

kjpbj8j(10)

bj2f0, 1g8j:(11)

Constraint (4) comes from (2). It forces y

‘,k

to be cho-

sen correctly so that if kis the crime closest to ‘,theny

‘,k

will be 1. This is because e

‘

is minimized within the ob-

jective, so y

‘,k

is necessarily going to correspond to the

index where e

‘

is minimized, and where the similarity

is maximized within (4). Constraints (5) and (6) further

deﬁne y

‘,k

’s by stating that a crime in the pattern needs

only to be connected to one closest neighbor ‘in the

pattern (this is the requirement of connectivity), and

its entries are binary. Constraint (7) comes from (3).

The value +

j

b

j

is the ‘

0

norm of k.Theb

j

’s are decision

variables where if b

j

=1, k

j

is nonzero. This formulation

is linear, and thus using MIP technology there is a guar-

antee on the optimality of the solution. In particular, the

solver will provide the duality gap, and when it is zero,

we know that the optimal solution to the optimization

problem has been attained.

FIG. 1. Adserieslike pattern consisting of three

dcores of various sizes.

8 WANG ET AL.

Cores

Ad-core is a set of crimes that exhibit similarity in a

feature subspace of ddimensions. A d-core has dde-

ﬁning features that are not predetermined. Further,

crimes in a d-core need to be well connected in the

similarity graph. Formally, the deﬁnition is as follows.

Deﬁnition 6. (d-core) A similarity graph G =(V,E)

with density threshold ais called a d-core if it satisﬁes

the core constraints:

Pattern speciﬁc constraint: the size of the deﬁning

feature set of the graph is equal to d, jL(G)j=d.

Pattern general constraints: G is connected, and G

is dense, jEj

jVj(jVj1) qa. That is, the fraction of pos-

sible edges in the graph exceeds a.

Two parameters, dand a, control the property of

the core in a pattern-speciﬁc and pattern-general way,

respectively.

The pattern-general constraints should be thought

of as being much looser than the pattern-speciﬁc con-

straints, as we often include unnecessary edges in the

similarity graph. For the set of crimes to be feasible

in the pattern-speciﬁc sense is much more difﬁcult, as

crimes in the pattern need to be similar to each other

in dseparate ways.

We ﬁnd cores G=(V,E) using an optimization method

to maximize the pattern-general cohesion while satisfy-

ing the core constraints.

maximize

G+

j2J

kjCohesionj(G)

subject to jVj=n

core constraints :

jL(G)j=d,

Gis connected,

jEj

jVj(jVj1) qa:

8

>

<

>

:

(12)

We propose a binary integer linear formulation for the

optimization problem (12). Let m=jVj, the number of

crimes in the database. Let nbe the size of the pattern

we want to discover, and we loop through possible val-

ues of n, re-solving each time. We deﬁne an m·msim-

ilarity matrix for each feature S1,...,SJwith elements

S

j

(‘,k)=s

j

(v

‘

,v

k

), which are precomputed. Let Xbe the

matrix of binary decision variables deﬁning the core.

X(‘,k) is 1 if a pair of crimes ‘and kare in core G,

which is the same as X(k,‘), so matrix Xis symmetric.

On the diagonal, X(‘,‘) represents whether crime ‘

is in core G. We use d1,...,dJ2f0, 1gto indicate

whether feature jis a deﬁning feature, or equivalently,

whether Cohesion

j

(G)‡h

j

. Let Ebe the adjacency ma-

trix for the (pattern-general) similarity graph, where

E(‘,k)=1iffv‘,vkg2E, and E(‘,k)=0 otherwise.

Since the graph is undirected, Eis symmetric. We

set E(‘,‘)=1 for computational simplicity. (EsX)is

the Hadamard product of matrix Eand X, where

(EsX)(‘,k)=E(‘,k)$X(‘,k). Mis a large auxiliary

parameter for formulating the problem with a big-M

formulation, and eis small. With this notation, the

optimization problem (12) can be reformulated:

max

X,fdjgj

+

j

kj+

‘,k

(SjX)(‘,k)s:t:

+

‘

X(‘,‘)=n(13)

1

n(n1) +

‘,k

(SjX)(‘,k)Mdjphj8j(14)

1

n(n1) +

‘,k

(SjX)(‘,k)MdjqhjM8j(15)

+

j2J

dj=d(16)

X(‘,k)=X(k,‘)8‘,k(17)

X(‘,k)pX(k,k)8‘,k(18)

X(‘,‘)þX(k,k)pX(‘,k)þ18‘,k(19)

(^

En1X)(‘,k)qX(‘,k)8‘,k(20)

X(‘,k), dj2f0, 1g8‘,k,j(21)

where matrix ^

Eis deﬁned just below.

Let us derive the objective. Since X(‘,k)=1 if and only

if both crimes ‘and kare in the core (that is, they are in

graph Gthat we discover), we have the following:

1

n(n1) +

‘,k

(SjX)(‘,k)=Cohesionj(G):(22)

The objective is the pattern-general cohesion. Equation

(13) ensures that the cores we discover are of size n.

Constraints (14), (15), and (16) are the pattern-

speciﬁc constraints. Constraint (14) forces d

j

=1 when

Cohesion

j

(G)‡h

j

, where the strict inequality is enforced

by e. Constraint (15) forces d

j

=0 when Cohesion

j

(G)<

h

j

. Constraint (16) speciﬁes the number of deﬁning

features as d. The symmetry of Xis enforced by (17).

DATA MINING FOR CRIME SERIES 9

Constraints (18) and (19) imply X(‘,k)=1 iff both

X(‘,‘) and X(k,k) are 1 and 0 otherwise.

Expression (20) is a pattern-general constraint. Our

formulation does not enforce the core to be connected,

but it does enforce something weaker, namely that each

node in a core of size nis at most n1 steps along the

similarity graph from any other node in the core. We

handle the connectivity afterward by examining the re-

sult to ensure that it is connected and labeling it as in-

feasible if not. To handle the constraint that each node

in the core is at most distance n1 from every other

node, we recall that in graph theory, if node v

k

is reach-

able from node v

‘

in exactly qsteps, E

q

(k,‘)>0. If node

v

k

is reachable from node v

‘

in at most n1 steps, it

means that at least one of E

q

(k,‘)>0 for q£n1.

We deﬁne the following matrix ^

En1, where an ele-

ment ^

En1(k,‘)indicates if node kand node ‘are dis-

tance at most n– 1 steps along the graph.

^

En1(‘,k)=

1(EþE2þ þEn1)(‘,k)>0

0otherwise.

8

<

:

Thus, (20) forces that if v

‘

and v

k

are both in the pattern,

they must be at most distance n– 1 along the graph.

The pattern-general density constraint is handled

similarly to the connectivity constraint, where feasibil-

ity is checked for each solution, and infeasible solutions

are removed.

The integer program ﬁnds one solution at a time. In

order to avoid ﬁnding a solution that was found previ-

ously, we introduce a constraint for each previous solu-

tion found. Suppose in the t-th run, the crimes in the

solution are Qt, that is, X(k,k)=1ifk2Q

t,X(k,

k)=0 otherwise. The constraint we add before running

the t+1-th time is

+

k2Qt

X(k,k)pn1:(23)

This constraint will exclude the current solution

from the feasible region, and we will obtain a different

solution in the next run if any feasible solutions remain.

Since all of the matrices are symmetric, in practice we

keep only the upper (or lower) triangle, including the

diagonal, to compute all of the sums.

Merging Cores

By our main hypothesis, the vast majority of crime se-

ries contain a core. This means that by ﬁnding all cores,

we would have located the vast majority of all crime se-

ries. We now grow the rest of the crime series from the

cores by merging them together. One advantage of

merging is that it allows the pattern to dynamically

change, as the deﬁning features from the merged

cores are not always equal to each other. Consider a

burglar’s M.O. with a shifting means of entry. At ﬁrst

he enters through unlocked doors, then he starts to

use bodily force to open doors, and later he learns to

use a screwdriver to pry the door open. His full pattern

thus consists of several smaller d-cores. This suggests a

more ﬂexible deﬁnition of pattern than a simple core.

We provide such a deﬁnition below.

Deﬁnition 7. (d-serieslike pattern) A graph G =(V,

E)is called a d-serieslike pattern with deﬁning feature

set P(G)of size d if it satisﬁes:

Pattern general constraint: G is connected.

Pattern speciﬁc constraint: Each node u in G is con-

tained in at least one subgraph G¢4G that is a d¢-

core with deﬁning features that include set P(G),

that is,P(G)4L(G¢),d£d¢.

Note that if a graph is a d-serieslike pattern, it is also a

(d– 1)-serieslike pattern, a (d– 2)-serieslike pattern, and

so on. The pattern speciﬁc constraints for d-serieslike

patterns are looser than those for d-cores. A d-serieslike

pattern may not be a d-core. On the other hand, a d-

core is a special case of a d-serieslike pattern. Before

we proceed, we must ensure that merging is justiﬁed.

Theorem 1. (The set of serieslike patterns is closed

under merging.) Suppose G

1

is a d

1

-serieslike pattern

and G

2

is a d

2

-serieslike pattern. If G

1

XG

2

s;, then

^

G=G1[G2is a d-serieslike pattern, with deﬁning fea-

tures P(G

1

)XP(G

2

),d=jP(G

1

)XP(G

2

)j.

Proof. First, since G

1

and G

2

are connected and

G

1

XG

2

s;, the union of them is also connected, that

is, ^

Gsatisﬁes the pattern general constraint. Then for

all nodes u2G1,dad

1

-core G1

usuch that u2G1

u

and P(G1)\P(G2)P(G1)L(G1

u), and for all

nodes u2G2,dad

2

-core G2

usuch that u2G2

uand

P(G1)\P(G2)P(G2)L(G2

u). So, either way, the

deﬁning feature set includes P(G

1

)XP(G

2

). This means

that for any node ^

u2^

G,da core ^

Gusuch that

P(G1)\P(G2)L(^

Gu).-

This leads directly to the following:

Corollary 1. Suppose G

1

is a d

1

-serieslike pattern, G

2

is a d

2

-serieslike pattern, .,G

n

is a d

n

-serieslike pattern,

and G1[...[Gnis connected,

jP(G1)\P(G2)\...\P(Gu)j=d:

10 WANG ET AL.

Then G1[...[Gnis a d-serieslike pattern.

These properties lead to the following breadth ﬁrst

search algorithm for mining d-serieslike patterns. We

start with the cores that we found using the integer

program. These cores are candidates for merging. We

also maintain an active pattern set that contains the d-

serieslike patterns that we are not done constructing.

We keep the d-serieslike patterns that we are done con-

structing in a maximal pattern set. To start, the active

pattern set contains all of the d-cores we found. For

each active pattern, we iterate through the candidates

to see if they meet the merging criteria provided just

below. If a merge is possible, and if the merged set

had not been previously created, we append the merged

pattern to the active pattern set and continue iterating

through the candidates. If there are no candidates that

can be merged with the active pattern at all, then the

active pattern is maximal, and it is placed in the max-

imal pattern list. The merging criteria for G

1

WG

2

to

form a d-serieslike pattern ^

Gis

G

1

XG

2

s;

jP(G

1

)XP(G

1

)j‡d.

The merging algorithm is formulated in Algorithm 1.

Algorithm 1: Merging Cores

INPUT: d, cores, each with ‡ddeﬁning features

candidate list)cores

active set)cores, each with defining features

maximal set);

while active sets;do

G

current

)any element in active set

P

current

)deﬁning features from G

current

isMaximal)TRUE;

for all G

j

2candidates, Gj6 Gcurrent do

if G

current

XG

j

s;,jP(G

current

)XL(G

j

)j‡dthen

^

G)Gcurrent [Gj

P(^

G))P(Gcurrent)\L(Gj)

if ^

Gdoes not exist in active set or maximal set then

isMaximal)FALSE;

append ^

Gto active set

end if

end if

end for

if isMaximal ==TRUE then

remove G

current

from active set, put into maximal set

end if

end while

OUTPUT: maximal set

Experiments

Our data set was provided by the Crime Analysis Unit

of the Cambridge Police Department in Massachusetts.

It has 7,067 housebreaks that happened in Cambridge

between 1997 and 2011, containing 51 hand-curated

patterns contained within the 4,864 crimes between

1997 and 2006. (Patterns from 2007 to 2012 were not

assembled at the time of writing.) Crime attributes

include geographic location, date, day of week, time

frame, location of entry, means of entry, an indicator

for ‘‘ransacked,’’ type of premise, an indicator for

whether residents were present, and suspect and victim

information. The 51 crime series identiﬁed by police

contain an average of 12.1 crimes each, with the largest

series containing 59 crimes, and the smallest series con-

taining 2 crimes. These crimes span an average period

of 42 days, with the shortest series taking place within 1

day and the longest series taking 451 days. Data were

processed using the similarity functions {s

j

}

j

discussed

in our previous work,

24,25

where each pairwise feature

is mapped into a number between 0 and 1. These sim-

ilarity measures are p-values, and they consider the

baseline frequency of each possible outcome for the

categorical variables. For instance, most crimes are

committed when residents are not present. If two crimes

were committed where one had residents present and

the other had residents that were not present, then

the similarity score for ‘‘residents present’’ is zero. If

both crimes were committed when the residents were

not present, the similarity score for ‘‘residents present’’

would not be high (in particular, it is 1p2

not present

where p

not present

is the proportion of crimes in the

database where residents were not present), whereas

if two crimes were committed with residents present,

the similarity would be much higher (1p2

present).

The similarity score for time frames is complicated be-

cause it takes into account the distribution of times

when crimes are more frequently committed. We took

the 51 hand-curated patterns and divided them ran-

domly into four subsets (folds) with sizes 12 or 13 pat-

terns each. We used three of the four folds to learn the

pattern-general weights and tested on the remaining

fold for the experiments discussed below.

Baselines

As this problem is fundamentally a clustering prob-

lem, we compare with several varieties of hierarchi-

cal agglomerative clustering and incremental nearest

neighbor approaches. For these baselines, we use sev-

eral different schemes to iteratively add discovered

crimes, starting from pairs of nodes with high sim-

ilarity c, which is a weighted sum of the attribute

similarities:

c(Ci,Ck)=+

J

j=1

^

kjsj(Ci,Ck):

DATA MINING FOR CRIME SERIES 11

Unlike our method where the weights are learned, the

weights ^

kfor the baselines were provided by crime an-

alysts based on domain expertise, similar to several

other works.

10,11

Hierarchical agglomerative clustering (HAC) begins

with each crime as a singleton cluster, and iteratively

merges the clusters based on the similarity measure be-

tween clusters. Nearest neighbor classiﬁcation (NN)

ﬁrst selects pairs of crimes with high similarity and

then iteratively grows a cluster by adding the nearest

neighbor crime to the cluster.

HAC and NN were used with three different crite-

ria for cluster–cluster or cluster–crime similarity:

single linkage (SL), which considers the most similar

pair of crimes; complete linkage (CL), which consid-

ers the most dissimilar pair of crimes; and group

average (GA), which uses the averaged pairwise simi-

larity.

51

When the nearest neighbor algorithm is

used with the S

GA

measure deﬁned below with

weights provided by crime analysts, it is similar to

the Bayesian Sets algorithm and how it is used for

set expansion.

52,53

SSL(G1,G2):=max

vk2G1,v‘2G2

c(vk,v‘)

SCL(G1,G2):=min

vk2G1,v‘2G2

c(vk,v‘)(24)

SGA(G1,G2):=1

jG1kG2j+

vk2G1

+

v‘2G2

c(vk,v‘):

Evaluation metrics

There are two levels of performance we evaluate -

pattern-level and object-level.

Pattern-level precision and recall

We evaluate the quality of the core detector using

pattern-level precision and recall. The d-cores are

smaller as dbecomes larger. The cores are used for

discovering larger merged patterns. Thus we evaluate

the accuracy of the core ﬁnder in its detection ability;

if a real pattern is missed completely by our core de-

tector, there is no way to recover from this in order

to detect it. If a core covers more than one pattern,

this is also a bad seed for further mining, since it

would generate misleading deﬁning features that do

not characterize any real patterns. Thus, we call

cores that cover one and only one real pattern good

cores. The pattern-level precision and recall are both

deﬁned using good cores. Ndenotes the number of

cores we discover.

P-Precision (cores) =+N

i1(core iis good)

N(25)

P-Recall (cores) =+N

i1(core iis good)

jPj :(26)

Note that pattern-level precision should be large, as

each real pattern should contain many cores, inﬂating

the reported precision values.

Object-level precision and recall

We evaluate the full pipeline for generating serieslike

patterns using object-level precision and recall. To do

this, for each pattern discovered, we determine how

close it is to one of the real patterns. If the discovered

pattern overlaps only one real pattern, then we call

this the dominating pattern and evaluate precision

and recall with respect to crimes in that pattern. If

the serieslike pattern overlaps more than one real pat-

tern, we assign the dominating pattern to be the real

pattern possessing the most crimes that overlap with

our discovered pattern. Note that it is possible for the

recall not to grow with the size of the discovered pat-

tern, as the dominating real pattern could change as

the discovered pattern grows larger. The deﬁnitions

of object-level precision and recall for a d-serieslike

pattern G=(V,E) are as follows:

O-Precision(G) =+jVj

‘=11(‘2dominating pattern)

jVj(27)

O-Recall(G) =+jVj

‘=11(‘2dominating pattern)

jVdominating patternj(28)

where jV

dominating pattern

jis the number of crimes in the

dominating real pattern.

Computational gain from similarity graph

The ﬁrst step in our method is to learn the similarity

graph. The similarity graph provides a computational

gain in that it creates constraints on possible cores, re-

ducing the feasibility region of the ILP. For this similar-

ity graph, recall that we desire crimes in the same real

pattern to be connected to each other. We call edges

connecting crimes that belong to the same pattern

good edges. If we have constructed the similarity graph

well, the similarity graph should have a higher percent-

age of good edges than if we had simply used the full

graph consisting of all possible edges. If we remove a

few good edges in the process, this is not problematic

12 WANG ET AL.

as long as the true patterns are still connected in the sim-

ilarity graph—this will be assessed when we assess the

quality of the cores and the full pipeline next.

Table 1 shows the percentages of good edges in both

the similarity graph and the full graph for four test

folds (the data were divided into four folds, and each

was used in turn as the test fold). The learning method

tends to reduce the number of unnecessary edges by a

factor of 7 or 8 in each of the test folds, as shown in the

third column (which is the ﬁrst column divided by the

second column). This reduction substantially reduces

computation for the core ﬁnder.

Pattern-general weights

The pattern-general weights come from the learning

step for the similarity graph. In Figure 2 we report

the mean over the test folds of the pattern-general

weights we discovered. The highest weights are similar-

ity in distance, number of days apart, suspect informa-

tion, whether residents are present, and means of entry

(e.g., pried, forced, cut screen).

Time windowing

Consider solving the ILP for ﬁnding cores on data from

4,864 housebreaks. Note that if we were to search for

patterns of size 10 among 1,000 crimes, this would

mean investigating 1000

10

2:634 ·1023 possible sub-

sets. Further, CPLEX would need to handle 1,000

2

con-

straints (18) and (19) of the optimization problem.

Luckily it is unlikely that a crime series would possess

a core of size 10, but still we need to ﬁnd ways to reduce

computation.

Because the pattern-general weight on closeness in

time is so high, we determined that we would be un-

likely to miss true cores if we considered windows of

time that include at least 200 crimes. We thus indexed

the housebreak records in chronological order and

created overlapping windowed blocks of 200 crimes

each, where neighboring blocks have an overlap of

100 crimes. Therefore we solve ILPs among crime sub-

sets f1, , 200g,f100, , 300g,,f4700, ,

4864g. In each subset, we input the number of deﬁning

features dand core size n, and then iteratively run the

ILP to get all feasible solutions by adding the constraint

(23) after each iteration to avoid returning repeated

solutions.

Evaluation of mining cores

We chose performance evaluation metrics from informa-

tion retrieval, and for some of these metrics, we need to

rank the discovered cores by a scoring function. This scor-

ing function represents how certain we are that these

cores are real. We use a scoring function that is a weighted

version of pattern-general cohesion and the (pattern-

speciﬁc) number of deﬁning features, as we desire cores

that are both tight in the pattern-general sense and in

the pattern-speciﬁc sense. Here is the score function:

Score(G)=+

J

j=1

kjCohesionj(G)þ1

6d:(29)

Series usually have about six deﬁning features, so the

choice of 1/6 tends to balance the two terms, weighing

the pattern-speciﬁc term slightly higher. We ordered

the discovered cores in decreasing order of the scores.

Note that the ﬁrst evaluation below does not require

these scores, but the second and third do.

1) Cores with different dWe expect that discov-

ered patterns with more deﬁning features are

more likely to be true crime series. Figure 3

shows how the precision increases with the num-

ber of deﬁning features d. In this ﬁgure, we con-

sider only cores of the same size (three crimes).

There are no overlapping cores between the

three bars, as each core is used once with its

exact number of deﬁning features d, which is 6,

Table 1. Test Results of Weights Learning Algorithm

Good edges

% in similarity

graphs

Good edges

% in complete

graphs

Reduction

factor

Training

time

1 1.84 0.26 7.1 3.64 ·10

3

s

2 2.42 0.33 7.3 9.58 ·10

3

s

3 3.34 0.42 8.0 6.20 ·10

3

s

4 2.95 0.34 8.7 5.41 ·10

3

s

FIG. 2. Pattern-general weights.

DATA MINING FOR CRIME SERIES 13

7, or 8. The number of discovered cores for d=6is

1072, d=7 is 215, and for d=8 is 36. There were

too few cores with more than eight deﬁning fea-

tures to reliably calculate precision. The reported

numbers of cores are totals from all test folds.

2) Cores with different size nWe used the scor-

ing function (29) as a ﬁlter to pick the best

1,000 cores, from each of the sizes 3, 4, and 5,

and discarded the other discovered cores. Larger

cores have higher chances of hitting a pattern,

since there are more crimes in the core; however,

they also have higher chances of hitting more

than one real pattern. As shown in Figure 4,

cores of size 4 have a much higher pattern-level

precision than cores of size 3, but cores of size

5 do not have noticeable gains over size 4. This

is because the increased probability of hitting a

real pattern cancels with the increased probabil-

ity of hitting more than one real pattern.

3) P-precision–p-recall curve We generated a full

list of cores of size 3 with dbetween 6 and 8, and

ranked the cores according to their scores. As we

moved down the list, we evaluated pattern-level

precision and recall at each step. We also did

the same procedure with the baseline iterative

nearest neighbor method used for generating

cores of size 3, using all of the similarity measures

in (24). (Note that for HAC we cannot control

the size of cores for evaluation.) The precision-

recall curves for all four methods averaged over

the test folds are plotted in Figure 5. It is clear

that our core ﬁnder is substantially better than

the baselines, though that is not surprising

given that it searches globally for the best cores.

Evaluation for mining serieslike patterns

We evaluated the quality of our full pipeline and the

baseline methods as follows. After all the serieslike pat-

terns were discovered, we evaluated the average object-

level precision and recall for all the patterns and over all

the test folds, plotted as a point on Figure 6. For HAC,

we simply iterated it, stopping at a threshold where re-

call was approximately equivalent to our method, and

again reported average object-level precision and recall

on Figure 6. For the nearest neighbor method, after

each element was added to a growing pattern, we eval-

uated precision and recall to trace out a precision-recall

curve. All three metrics in (24) were used for HAC and

nearest neighbors. We note that for the same level of

recall, the precision attained by our method was quite

a bit higher than that of other methods.

There is still a lot of room for improvement. Cur-

rently, with precision on the order of 53%, we capture

approximately 18% of the crimes identiﬁed by analysts.

That is, when our method returns a crime, it is a real

crime in a series about 53% of the time. Weare returning

about one-ﬁfth of the crimes at these settings, so our

6 7 8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Number of definin

g

features d

Pattern level precision

FIG. 3. Pattern-level average precision of dcores

of different sizes.

3 4 5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Size of core sets

Precision for best 1000 core sets

FIG. 4. Pattern-level precision of cores with

different sizes.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Pattern level recall

Pattern level precision

NN

SL

NN

CL

NN

GA

Core Sets

FIG. 5. Average pattern-level precision vs. recall.

14 WANG ET AL.

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

0.1

0

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Object level recall

Object level precision

NNSL

NNCL

NNGA

HACSL

HACCL

HACGA

d series−like patterns

FIG. 6. Object-level precision and recall.

FIG. 7. The locations of crimes in the ﬁrst series.

DATA MINING FOR CRIME SERIES 15

method currently is conservative–it will not claim that a

crime is in a series unless it is reasonably certain. This

can help ensure that analysts do not overlook crimes

that are very likely to be in a particular series. Note

that our ground truth consists of crime series that are

hand-labeled by analysts, and these labels are not perfect.

For instance, it is entirely possible that we return crimes

that are in a series that the police had not previously con-

sidered. We might opt to change the settings so that the

method returns more crimes as being potentially part of

the series (giving lower precision but higher recall). This

might be changed by making the core ﬁnder less conser-

vative, ﬁnding a way to incorporate crimes that are not in

cores, or possibly by working with domain experts to im-

prove the database used for evaluation. We note that the

baseline methods work surprisingly well and are them-

selves reasonable options.

Case Studies

We performed a blind test, where we aimed to detect

crime patterns between 2007 to 2012, for which we

do not have pattern data. The results were analyzed

by hand by crime analysts.

Case study 1

One particularly interesting crime series includes 10

crimes from November 2006 to March 2007. Figure 7

shows geographically where these crimes were located.

Table 2 provides some of the details about the crimes

within the series.

From the (pattern-general) similarity graph, we iso-

lated the subgraph containing the 10 crimes, which is

diagrammed in Figure 8. Crimes 1 to 5 are well con-

nected as a subset, and crimes 6 to 10 are well con-

nected as another subset. From only the similarity

graph, the two subsets do not seem very related except

for a single edge between crimes {5, 6} connecting

them; however, this is only the pattern-general part

of the story.

We used the integer linear program (12) to discover

cores of size 3 with at least 6 deﬁning features. Table 3

lists the cores and their deﬁning features in this series,

where a check mark means the feature is a deﬁning fea-

ture and a circle means it is not. These cores show how

the crimes are similar to each other in a pattern-speciﬁc

way. The next step is merging the cores. The deﬁning

feature set P(G) was chosen to include six features,

which are geographic location, days apart, location of

entry, the ransacked indicator, time of day, and day of

the week. One of the cores, core 16 in Table 3, which

was not included as geographic location is not a deﬁn-

ing feature for that core, and the rest of the cores were

merged. (The same set of crimes is in the merged pattern

regardless of whether core 16 was used for the merge.)

As these data were reconsidered by crime analysts,

we found out that when these crimes were analyzed

back in 2006–2007, they were viewed as two unrelated

patterns, one at the end of 2006, crimes 1 to 5, and one

at the beginning of 2007, crimes 6 to 10. The connec-

tion between these two subsets of crime is very sub-

tle, and there is over a month gap between the two

Table 2. Example 1 of a Serieslike Pattern with d=6

No. Date

Location

of entry

Means

of entry Premises Ransacked Residents Time of day Day Suspect Victim

1 11/8/06 Basement door Unknown Unknown No Not in 10:45–15:00 Wed Null 1 F

2 11/8/06 Front door Pried Unknown No Not in 8:00–18:30 Wed Null 1 M

3 11/16/06 Front door Shoved/forced Unknown No Not in 9:00–17:00 Thur Null 1 M

4 12/7/06 Front door Pried Unknown No Not in 9:00–17:00 Thur Null 1 F

5 12/22/06 Front door Pried Unknown No In 11:48 Fri Null 1 M

6 2/1/07 Front door Shoved/forced Unknown No In 14:45 Thur 3 Males 1 F

7 2/15/07 Front door Unknown Aptment No In 12:00–13:30 Thur Null 2 F

8 3/5/07 Front door Shoved/forced Aptment No Not in 12:22–14:56 Mon Null White M

9 3/5/07 Front door Broke Aptment No Not in 12:22–14:56 Mon Null 1 F & 1 M

10 3/8/07 Front door Pried Aptment No Not in 12:50–13:30 Thur Null 1 M

F, female; M, male.

10

41

2

3

5

9

6

8

7

FIG. 8. Similarity graph for the ﬁrst crime series.

16 WANG ET AL.

patterns, so it did not occur to the crime analysts to link

them. Their intuition agrees completely with the simi-

larity graph, as the two subsets are weakly connected

only by one edge; however, recall that this only de-

scribes the pattern-general similarity–what one would

expect from generic pattern without considering a spe-

ciﬁc M.O. On examination of the cores, not only are

they correlated in six features, but ﬁve of the cores

(core indices 11, 12, 14, 15, 16) contain crimes from

both of the subsets, which is strong evidence that the

two subsets should be merged together. It is particu-

larly interesting that the core consisting of crimes 5,

6, and 7 spanned the two subsets, where these crimes

share the unusual feature that residents were present

during the break-in.

We wondered why the offenders changed their M.O.

to move a few blocks north since they committed

crimes 1–6 in the same area. What may have happened

is that the criminals left near the end of December (just

before the holidays), and returned in February to com-

mit housebreak number 6 in the same area as 1–5;

however, they were witnessed committing crime 6 (sus-

pect information in crime 6 reads ‘‘3 males’’). This may

have spooked the offenders, causing them to alter their

M.O. by moving north to commit crimes 7–10. Ana-

lysts now believe that these two series were actually a

single series, and that the suspect information from

crime 6 can be carried through to all the crimes in

the discovered series.

This is a good example to show how crime patterns

can be composed of cores and exhibit similarity both in

a pattern-general way and a pattern-speciﬁc way. It

shows how we can use both aspects to mine patterns.

This is a pattern that would be very difﬁcult for a

crime analyst to ﬁnd: the M.O. changes over time,

there was a long break in the middle of the series,

and there was nothing deterministic (e.g., ﬁngerprints)

linking these crimes.

Table 3. Cores and Their Deﬁning Features for Example 1

Cores Crimes

Geo

loc

Days

apart

Loc of

entry

Means

of entry Premises Ransacked Residents

Time

of day Day Suspect Victim

1123XX X OO XOXX OO

3125XX X OO XOXX OO

4134XX X OO XOXX OO

5135XX X OO XOXX OO

6145XX X OO XOXX OO

7156XX X OO XOXX OO

8234XX X OO XOXX OO

9235XX X OO XOXX OX

10 2 4 5 XX X X OXOXX OO

11 2 4 6 XX X OO XOXX OO

12 2 5 6 XX X OO XOXX OO

13 3 4 5 XX X OO XOXX OO

14 3 5 6 XX X OO XOXX OO

15 4 5 6 XX X OO XOXX OO

16 5 6 7 OXX OO XXXXOO

17 6 8 9 XX X OO XOXX OO

18 6 8 10 XX X OO XOXX OO

19 6 9 10 XX X OO XOXX OO

20 7 8 9 XX X OXX OXX OO

21 7 8 10 XX X OXX OXX OO

22 7 9 10 XX X OXX OXX OO

23 8 9 10 XX X OXX OXX OO

Table 4. Data from the Second Crime Series

No. Date

Location

of entry Means of entry Premises Ransacked Residents Time of day Day Suspect Victim

1 1/25/07 Front Door Punched/Popped Apartment No Not in 10:20–12:00 Thur null 1 F

2 1/25/07 Unknown Cut Screen Apartment No Not in 8:45–14:30 Thur null 1 M

3 1/25/07 Unknown Pried Apartment No Not in 9:10–21:00 Thur null 1 F

4 1/25/07 Front door Unlocked Apartment No In 13:00–13:30 Mon null 1 F

5 1/29/07 Front door Key Apartment No Not in 14:52–14:52 Mon null 2 M & 3 F

6 1/29/07 Unknown Unknown Apartment No Not in 12:00–12:00 Mon 1 M 1 M

7 1/29/07 Unknown Unknown Apartment No Not in 15:00–15:00 Mon null 2 M

DATA MINING FOR CRIME SERIES 17

Case study 2

Table 4 and Figure 9 show a more typical pattern

in 2007 discovered by our method. The crimes were

committed on two dates in late January 2007, most of

them in the same building. According to the Cam-

bridge police, they arrested a suspect while he was com-

mitting the last crime in the series and conﬁrmed that

he did commit crimes 2–7. Note that the record for the

fourth crime records the means of entry as ‘‘key.’’ This

is based only on the claim of a witness that the offender

possesses a master key. If the offender did possess a

master key to the apartments in the building, it

would explain why the means of entry was unknown

for other crimes in the series—the means of entry

would thus have been very difﬁcult to determine for

earlier crimes where residents were not present.

Case study 3

Our method has also been used to help ensure the ﬁdel-

ity of the historical database of past crimes. We ran our

method using data collected prior to 2006 to see

whether we would be able to make discoveries.

FIG. 9. The locations of crimes in the second series.

Table 5. The Third Serieslike Pattern with d=5

No. Date Location of entry Means of entry Premises Ransacked Residents Time of day Day Suspect Victim

1 2/9/05 Ground window Shoved/forced Apartment No In 1:47 Wed 1 White M 1 F

2 2/9/05 Rear door Shoved/forced Single-family House No In 9:50 Wed 1 Black M 1 M & 1 F

3 2/15/05 Ground window Broke Apartment Yes Not in 7:00–13:30 Tue Null 2 M

4 2/21/05 Front door Key Unknown No Not in 7:10–10:00 Mon Null 1 F

5 2/23/05 Front door Pried Apartment No Not in 7:10–16:00 Wed Null 2 M

6 2/23/05 Front door Pried Apartment No Not in 7:00–14:00 Wed Null 2 M

7 2/23/05 Front door Pried Apartment No Not in 7:45–17:25 Wed Null 2 F

8 2/28/05 Rear door Unknown Apartment No Not in 20:55 Mon 1 White M 1 F

18 WANG ET AL.

Table 5 shows details from crimes within a 2005 pat-

tern, discovered by both the police and our algorithm.

Our algorithm and the crime analysts agreed on six out

of the eight crimes in the series, but disagreed on two

crimes: crime 3 and crime 4. The crime analysts iden-

tiﬁed these crimes as part of the pattern, but our algo-

rithms did not identify these crimes as being part of the

pattern. Our algorithm provides reasons why these

crimes should be excluded from the pattern: They are

not close to other crimes in the pattern-general sense

and are not connected to the other crimes in the series

within the similarity graph, as depicted in Figure 10.

Neither of the crimes are contained in any cores, as

shown in Table 6. In particular, the map in Figure 11

shows that these two crimes are geographically far

away from the other crimes. Since geographic closeness

has a large contribution to pattern-general similarity,

crimes 3 and 4 are already not likely to be part of the

same series. Besides that, we also notice that other

aspects of crimes 3 and 4 differ from the rest of the

25

1

6

7

4

3

8

FIG. 10. Similarity graph for the third crime

series.

FIG. 11. The locations of crimes in the third series.

DATA MINING FOR CRIME SERIES 19

pattern (see Table 5). In crime 4, the means of entry is

by key, which is different from that of other crimes

where entries were forced or pried. According to police

narratives, the fourth crime had a 91-year-old victim,

who reported that $35 was stolen from her purse by

someone who used a key to enter the apartment; this

indicates the crime was not part of the series, and fur-

ther indicates the possibility that this crime never actu-

ally occurred. In crime 3, the apartment was ransacked

while none of the other locations were. The narrative

for the third crime is very generic: someone broke in

through the living room window, stole jewelry and

loose change, and exited through the front door. It is

not clear that this crime possesses any distinguishing

characteristics to identify it as being part of the series.

When examining the crime series retrospectively,

the police agreed that crimes 3 and 4 likely should be

excluded from the pattern (note that this pattern has

not been ofﬁcially veriﬁed through an arrest).

Conclusions

The problem of detecting crime series is both subtle and

difﬁcult. In cases where crime series detection is trivial,

meaning that there is identiﬁable information about

the criminal (e.g., ﬁngerprints or DNA) then sophisti-

cated modeling techniques are not needed—the solution

is direct in those cases. The more subtle cases where

crime series might otherwise become hidden in a sea

of data are where data mining methods can shine. Our

automated series detection methods are able to detect

crime patterns within big data sets where a human

could not. Detecting patterns among crime data is criti-

cal to effective policing. When a pattern is identiﬁed, po-

lice can implement effective strategies to prevent future

crimes, solve past crimes, identify and capture offenders,

and ensure offenders are investigated and prosecuted

fully for the crimes for which they are responsible. We

envision that methods such as ours could be used very

broadly for crime series detection, and would be partic-

ularly useful across regions or districts, where human

crime analysts would fail to be able to manually handle

vast amounts of data from several different regions or

data sources in order to detect patterns.

Acknowledgment

This work was supported by a grant from MIT Lincoln

Laboratories.

Author Disclosure Statement

No competing ﬁnancial interests exist.

References

1. Woodhams J, Bull R, Hollin CR. Chapter 6: Case linkage: identifying crimes

committed by the same offender. In: Criminal Proﬁling: International

Theory, Research, and Practice. New York: Humana Press Inc., 2007.

2. Gwinn SL, Bruce C, Cooper JP, Hick S. Exploring Crime Analysis. Readings

on Essential Skills, 2nd ed. Charleston, SC: BookSurge, LLC, 2008.

3. Hering AS, Bair S. Characterizing spatial and chronological target selec-

tion of serial offenders. J R Stat Soc Ser C. 2014;63:123–140.

4. Wolfgang ME. Delinquency in a Birth Cohort. Chicago: University of Chi-

cago Press, 1987.

5. Weisel DL. Burglary of single-family houses. Problem-Oriented Guides for

Police Series, No. 18, 2002.

6. Computing Community Consortium, Computing Research Association.

2012. Challenges and opportunities with big data, a community white

paper developed by leading researchers across the United States.

www.cra.org/ccc/ﬁles/docs/init/bigdatawhitepaper.pdf

7. Rudin C, et al. (A working group of the American Statistical Association).

2014. Discovery with data: leveraging statistics with computer science

to transform science and society. www.amstat.org/policy/pdfs/

BigDataStatisticsJune2014.pdf

8. Porter MD. A statistical approach to crime linkage. ArXiv e-prints, 2014.

9. Cocx TK, Kosters WA. A distance measure for determining similarity be-

tween criminal investigations. In: Adderley R, Musgrove P, (eds.), Advan-

ces in Data Mining. Applications in Medicine, Web Mining, Marketing,

Image and Signal Mining, Volume 4065 of Lecture Notes in Computer

Science. Berlin: Springer, 2006, pp. 511–525.

10. Lin S, Brown DE. An outlier-based data association method for linking

criminal incidents. Decis Supp Syst 2006;41:604–615.

11. Nath SV. Crime pattern detection using data mining. In: Proceedings of

the Web Intelligence and Intelligent Agent Technology Workshops; 2006,

pp. 41–44.

12. Singla CR, Dembla D, Chaba Y, Singla K. 2008. An optimal kd model for

crime pattern detection based on semantic link analysis-a data mining

Table 6. Cores and Their Deﬁning Features for Example 4

Cores Crimes

Geographic

Location

Days

apart

Location

of entry

Means

of entry Premises Ransacked Residents

Time

of day Day Suspect Victim

1 128 OXX OXX OOXOO

2 256 OXX OXX OXX OO

3 257 OXX OXX OXX OO

4 258 OXX OXX OOXOO

5 267 OXX OXX OXX OO

6 268 OXX OXX OOXOO

7 278 OXX OXX OOXOO

8 567 XXXXX X OXX OX

9 568 XXXOXX OOXOO

10 578 XXXOXX OOXOO

11 678 XXXOXX OOXOO

20 WANG ET AL.

tool. www.researchgate.net/publication/228691733_An_Optimal_KD

_ Model_for_Crime_Pattern_Detection_Based_on_Semantic_Link

_Analysis-A_ Data_Mining_Tool.

13. Bennell C, Bloomﬁeld S, Snook B, et al. Linkage analysis in cases of serial

burglary: comparing the performance of university students, police pro-

fessionals, and a logistic regression model. Psychol Crime Law.

2010;16:507–524.

14. Bennell C, Canter DV. Linking commercial burglaries by modus

operandi: tests using regression and ROC analysis. Sci Justice. 2002;42:

153–164.

15. Bennell C, Jones N. Between a ROC and a hard place: a method for linking

serial burglaries by modus operandi. J Invest Psychol Offender Proﬁling.

2005;2:23–41.

16. Bennell C, Jones N, Melnyk T. Addressing problems with traditional crime

linking methods using receiver operating characteristic analysis. Legal

Criminol Psychol. 2009;14:293–310.

17. Bennell C, Snook B, Macdonald S, et al. Computerized crime linkage

systems a critical review and research agenda. Criminal Justice Behav

2012;39:620–634.

18. Brown DE, Hagen S. Data association methods with applications to law

enforcement. Decis Support Syst. 2003;34:369–378.

19. Ellingwood H, Mugford R, Bennell C, et al. Examining the role of similarity

coefﬁcients and the value of behavioural themes in attempts to link serial

arson offences. J Invest Psychol Offender Proﬁling. 2012;10:1–27.

20. Markson L, Woodhams J, Bond JW. Linking serial residential burglary:

comparing the utility of modus operandi behaviours, geographical

proximity, and temporal proximity. J Invest Psychol Offender Proﬁling

2010;7:91–107.

21. Tonkin M, Santtila P, Bull R. The linking of burglary crimes using offender

behaviour: testing research cross-nationally and exploring methodology.

Legal Criminol Psychol. 2012;17:276–293.

22. Tonkin M, Grant T, Bond JW. To link or not to link: a test of the case linkage

principles using serial car theft data. J Invest Psychol Offender Proﬁling.

2008;5:59–77.

23. Tonkin M, Woodhams J, Bull R, et al. Linking different types of crime using

geographical and temporal proximity. Criminal Justice Behav.

2011;38:1069–1088.

24. Wang T, Rudin C, Wagner D, Sevieri R. Learning to detect patterns in

crime. In: Machine Learning and Knowledge Discovery in Databases.

Springer, 2013, pp. 515–530.

25. Wang T, Rudin C, Wagner D, Sevieri R. Detecting patterns of crime with

series ﬁnder. In: Proceedings of AAAI Late Breaking Track; 2013.

26. Huggins JH, Rudin C. A statistical learning theory framework for super-

vised pattern discovery. In: Proceedings of SIAM Conference on Data

Mining (SDM); 2014.

27. Adderley R. The use of data mining techniques in operational crime

ﬁghting. In: Intelligence and Security Informatics, Volume 3073 of Lecture

Notes in Computer Science. Chen H, Moore R, Zeng DD, Leavitt J (eds.).

Berlin: Springer, 2004, pp. 418–425.

28. Adderley A, Musgrove P. Modus operandi modelling of group offending:

a data-mining case study. Int J Police Sci Manage. 2003;5:265–276.

29. Dahbur K, Muscarello T. Classiﬁcation system for serial criminal patterns.

Artif Intell Law. 2003;11:251–269.

30. Hering AS, Kazor K. A permutation test to identify important attributes for

linking crimes of serial offenders. Stat. 2013;2:211–226.

31. Ma L, Chen Y, Huang H. Ak-modes: a weighted clustering algorithm for

ﬁnding similar case subsets. In: Proceedings of the International Confer-

ence on ISKE, 2010; 2010, pp. 218–223.

32. Reich BJ, Porter MD. Partially supervised spatiotemporal clustering for

burglary crime series identiﬁcation. J R Stat Soc Ser A. 2014;178:465–480.

33. Santtila P, Fritzon K, Tamelander AL. Linking arson incidents on the basis

of crime scene behavior. J Police Criminal Psychol. 2004;19:1–16.

34. Santtila P, Junkkila J, Kenneth Sandnabba N. Behavioural linking of

stranger rapes. J Invest Psychol Offender Proﬁling 2005;2:87–103.

35. Sorochinski M, Gabrielle Salfati C. The consistency of inconsistency in

serial homicide: patterns of behavioural change across series. J Invest

Psychol Offender Proﬁling. 2010;7:109–136.

36. Kriegel H-P, Kro

¨ger P, Zimek A. Clustering high-dimensional data: a survey

on subspace clustering, pattern-based clustering, and correlation clus-

tering. ACM Trans Knowledge Discov Data. 2009;3:1–58.

37. Pei J, Zhang X, Cho M, et al. Maple: a fast algorithm for maximal pattern-

based clustering. In: Proceedings of the International Conference on Data

Mining (ICDM). New York: IEEE, 2003, pp. 259–266.

38. Wang H, Wang W, Yang J, Yu PS. Clustering by pattern similarity in large

data sets. In: Proceedings of ACM SIGMOD; 2002, pp. 394–405.

39. Domeniconi C, Papadopoulos D, Gunopulos D, Ma S. Subspace clustering

of high dimensional data. In: Proceedings of the SIAM Conference on

Data Mining (SDM); 2004, pp. 517–521.

40. Vidal R. Subspace clustering. IEEE Signal Process Mag. 2011;28:52–68.

41. Gu¨ nnemann S, Boden B, Seidl T. DB-CSC: a density-based approach for

subspace clustering in graphs with feature vectors. In: Proceedings of the

European Conference on Machine Learning and Principles and Practice of

Knowledge Discovery in Databases (ECML-PKDD). New York: Springer,

2011, pp. 565–580.

42. Moser F, Colak R, Raﬁey A, Ester M. Mining cohesive patterns from graphs

with feature vectors. In: Proceedings of the SIAM Conference on Data

Mining (SDM); 2009, volume 9, pp. 593–604.

43. McFowland E, Speakman S, Neill DB. Fast generalized subset scan

for anomalous pattern detection. J Mach Learn Res. 2013;14:

1533–1561.

44. Neill DB, Cooper GF. A multivariate bayesian scan statistic for early event

detection and characterization. Mach Learn. 2010;79:261–282.

45. Guan Y, Dy JG, Jordan MI. A uniﬁed probabilistic model for global

and local unsupervised feature selection. In: Proceedings of the 28th

International Conference on Machine Learning (ICML); 2011, pp. 1073–

1080.

46. Chen H, Chung W, Xu JJ, et al. Crime data mining: a general framework

and some examples. IEEE Comput. 2004;37:50–56.

47. Thongtae P, Srisuk S. An analysis of data mining applications in crime

domain. In: Proceedings of the IEEE 8th International Conference on

Computer and Information Technology Workshops; 2008, pp. 122–126.

48. Berk RA. Algorithmic criminology. Secur Inform. 2013;2:5.

49. Berk R. Criminal Justice Forecasts of Risk: A Machine Learning Approach.

New York: Springer, 2012.

50. Berk R, Sherman L, Barnes G, et al. Forecasting murder within a popula-

tion of probationers and parolees: a high stakes application of statistical

learning. J R Stat Soc Ser A 2009;172:191–211.

51. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning:

Data Mining, Inference and Prediction. New York: Springer, 2009.

52. Ghahramani Z, Heller K. Bayesian sets. In: Proceedings of Neural Infor-

mation Processing Systems (NIPS); 2005.

53. Letham B, Rudin C, Heller KA. Growing a list. Data Mining Knowledge

Disco. 2013;27:372–395.

Cite this article as: Wang T, Rudin C, Wagner D, Sevieri R (2015)

Finding patterns with a rotten core: data mining for crime series with

cores. Big Data 3:1, 3–21, DOI: 10.1089/big.2014.0021.

Abbreviations Used

M.O. ¼modus operandi

HAC ¼hierarchical agglomerative clustering

NN ¼nearest neighbor

SL ¼single linkage

CL ¼complete linkage

GA ¼group average

ILP ¼integer linear programming

DATA MINING FOR CRIME SERIES 21