Conference PaperPDF Available

Network discovery using wide-area surveillance data

Authors:

Abstract and Figures

Network discovery of clandestine groups and their organization is a primary objective of wide-area surveillance systems. An overall approach and workflow to discover a foreground network embedded within a much larger background, using vehicle tracks observed in wide-area video surveillance data is presented and analyzed in this paper. The approach consists of four steps, each with their own specific algorithms: vehicle tracking, destination detection, cued graph exploration, and cued graph detection. Cued graph exploration on the simulated insurgent network data is shown to discover 87% of the foreground graph using only 0.5% of the total tracks or graph's total size. Graph detection on the explored graphs is shown to achieve a probability of detection of 87% with a 1.5% false alarm probability. We use wide-area, aerial video imagery and a simulated vehicle network data set that contains a clandestine insurgent network to evaluate algorithm performance. The proposed approach offers significant improvements in human analyst efficiency by cueing analysts to examine the most significant parts of wide-area surveillance data.
Content may be subject to copyright.
Network Discovery Using
Wide-Area Surveillance Data
Steven T. Smith, Andrew Silberfarb, Scott Philips, Edward K. Kao, and Christian Anderson*
MIT Lincoln Laboratory; 244 Wood Street; Lexington, MA 02420
{ stsmith, drews, scott.philips, edward.kao, christian.anderson }@ll.mit.edu
Abstract—Network discovery of clandestine groups and their
organization is a primary objective of wide-area surveillance
systems. An overall approach and workflow to discover a fore-
ground network embedded within a much larger background,
using vehicle tracks observed in wide-area video surveillance
data is presented and analyzed in this paper. The approach
consists of four steps, each with their own specific algorithms:
vehicle tracking, destination detection, cued graph exploration,
and cued graph detection. Cued graph exploration on the
simulated insurgent network data is shown to discover 87% of the
foreground graph using only 0.5% of the total tracks or graph’s
total size. Graph detection on the explored graphs is shown to
achieve a probability of detection of 87% with a 1.5% false
alarm probability. We use wide-area, aerial video imagery and
a simulated vehicle network data set that contains a clandestine
insurgent network to evaluate algorithm performance. The pro-
posed approach offers significant improvements in human analyst
efficiency by cueing analysts to examine the most significant parts
of wide-area surveillance data.
Keywords: Network discovery, graph detection, wide-area
surveillance, tracking, graph sampling, spectral detection.
I. P
ROBLEM STATEMENT
Network discovery of clandestine groups and their organiza-
tion is a primary objective of wide-area surveillance systems.
Discovering such networks hidden within a sea of normal
activity is a very challenging problem [16]. Building reliable,
high-confidence graphs representing networks of actors within
the field of view is difficult because the dynamic links between
network nodes have varying reliability, and because the total
number of links—the size of the underlying network—is very
large. With these realities in mind, an overall approach and
workflow to discover a foreground network embedded within
a much larger background using wide-area surveillance data
is presented and analyzed in this paper. The input to this
process is surveillance data, and the output is a semi-automated
estimate of the foreground network of interest represented as
a graph whose vertices or nodes represent geographical sites
within the field of view, and whose edges represent vehicle
tracks between the source and destination nodes. The methods
developed in this paper to address this problem are general
enough to apply to a variety of measurement modes. The
paper will focus on aerial video data; an example image is
shown in Figure 1. The problem is one of discrete detection
*This work is sponsored by the United States Department of Defense under
Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions
and recommendations are those of the author and are not necessarily endorsed
by the United States Government.
Figure 1. Example image of a wide-area surveillance video from the AFRL
CLIF data set [1].
and estimation: estimate a graph within a scene given a
collection of time-sequenced imagery and detect a foreground
subgraph within the estimated graph. The detected subgraph
represents the discovered foreground network in the presence
of a larger background network. The distinction between
“foreground” and “background” networks is problem-specific
and necessarily depends upon external cues.
Advances in sensor and digital storage technology, as well
as significant reductions in the cost of both, enable the collec-
tion of large amount of surveillance data. For example, video
imagery of a 10 km-by-10 km scene can yield Terabytes of
data per hour. Exploiting these large data sets demands inten-
sive processing capabilities, highly efficient, low-complexity
algorithms, and the ability to focus on the subset of the data
that is most relevant. The challenge in exploiting this data
is not in the frontend sensor data or processing, but in the
backend determination of which small fraction of the data
is relevant to human analysts. Because analyst resources are
limited and expensive, automated or semi-automated tools that
cue the analysts offer the potential for significant performance
gains in the task of constructing foreground graphs from
surveillance data.
Given a time series of wide-area surveillance imagery, the
14th International Conference on Information Fusion
Chicago, Illinois, USA, July 5-8, 2011
978-0-9824438-3-5 ©2011 ISIF 2042
Vehicle
Detection and
Tracking
Destination
Detection and
Clustering
Graph
Exploration
Graph
Detection
Detection and
Vehicle
Figure 2. Workflow for network discovery from wide-area surveillance data.
frontend problem is to detect the moving objects within the
field of view, typically individuals or vehicles, then associate
and track these movers to provide frame-to-frame state esti-
mates of their position and dynamics. Because the objective
is to form a graph whose vertices and edges depend upon
the tracks themselves, accuracy of the graph exploration and
network detection problems are highly sensitive to errors
either in the identification of track destinations or tracking
errors such as breaks and swaps. Misidentification of a track
destination leads directly to incorrect association to other sites
while the correct site is potentially ignored. Therefore, such a
system demands very low probabilities of site/track misassoci-
ation and track errors because they propagate and accumulate
through the exploration and detection step. In practice, track
breaks and crossings are a problem, especially in target dense
environments. Fully automated solutions, though desirable,
suffer from the practical problems of incorrect associations.
Tracking is currently implemented as a semi-automated pro-
cess that requires analysts to examine and repair any tracks
deemed to be relevant to the construction of the foreground
graph. Prioritization of which tracks analysts should examine
then becomes a key part of the workflow involved in network
discovery.
This paper is organized into sections that describe each
part of the proposed workflow (Figure 2): the overall ap-
proach (Section II), vehicle tracking (Section III), destination
detection and clustering (Section IV), cued graph exploration
and prioritization (Section V), and cued graph detection (Sec-
tion VI).
II. A
PPROACH AND WORKFLOW
We propose a semi-automated approach to the problem
of network discovery that allocates human and computer
resources at the tasks for which they are best suited. The
huge volume of video data is first processed to extract tracks
and determine destinations. External cues identify the part of
the scene that is of interest. This allows the human analyst
to explore a graph that contains the foreground network and
ultimately to detect this network within the explored graph.
Distinct algorithms are used for each step in this workflow,
which is comprised of four basic steps: tracking movers, des-
tination detection and clustering, cued graph exploration with
prioritization, and cued graph detection. Tracking involves
detecting movers within the field of view and tracking these
using a feature-aided association algorithm/tracker.
Error-free tracker performance is not assumed or required,
as analysts will be used to assess track quality to insure
that high-confidence vertices and edges are added to the
graph, thereby avoiding the accumulation and propagation
of errors in graph construction. Once tracks are available,
their destinations are determined, and these destinations are
clustered into hypothesized sites. Ultimately, a network will
be constructed between geographical sites within the the
scene; these sites are defined by the output of the destination
clustering method described in Section IV, or predefined using
existing knowledge, or defined by an analyst using external
information. This processing provides the raw ingredients for
the exploration of the graph, and the analyst may promote
these sites and tracks as vertices and edges in the candi-
date foreground network. As more nodes are added to the
graph, these nodes are considered candidate foreground nodes
because of their association with the cued foreground node.
A graph exploration algorithm is run to prioritize each track
emanating from the new nodes. In the simplest case, without
explicitly modeling the foreground network, this exploration
algorithm simply computes the distance between the node a
track emanates from and the cued foreground node. More
sophisticated models of the foreground network may use time,
spatial proximity, network topology, site and track features
to compute track priority. As analysts promote more vertices
and edges to the constructed graph, a community detection
algorithm can be used to label each vertex as part of the
foreground subgraph.
The process of graph exploration yields a network contain-
ing some fraction of the foreground network, yet does not
make the final distinction between foreground and background
sites. This binary decision—graph detection—is the final step
of the process. Note that from the perspective of hypothesis
testing, the difference between graph exploration and graph
detection is that graph exploration yields a decision about
which specific vertices and edges should be added to the graph
or not, whereas graph detection yields a decision about which
of the graph vertices belong to the foreground network.
The metrics used to evaluate performance in this process
are related to standard detection metrics such as the receiver
operating characteristic (ROC) [15]. For graph exploration, the
objective is to uncover as many of the foreground sites as
possible while minimizing the number of tracks examined and
background sites investigated. A superior graph exploration
algorithm discovers the same number of foreground vertices
but with a lower number of tracks or background vertices
examined (Figure 8 and 9 in Section V). As usual, detection
performance evaluation requires knowledge of the true fore-
ground network, which is available either from knowledge of
a simulation, experimental setup, or, ideally, a nontrivial real-
world example thoroughly examined by experienced analysts.
Classically, the ROC is determined by three key operating
2043
Figure 3. Simulated insurgent network graph comprised of 4,478 locations
and 116,720 tracks. The foreground subgraph is shown using red nodes and
edges, and the background graph is shown using blue nodes and gray edges.
For clarity, only a partial graph is shown here.
parameters: the probability of detection (PD), probability of
false alarm (PFA), and signal-to-noise ratio (SNR), and algo-
rithms are typically evaluated by comparing PDs at a constant
false alarm rate (CFAR), or by comparing PFAs at a specific
PD, all at a fixed SNR. In the context of graph exploration,
the percentage of the true foreground network sites uncovered
corresponds to PD, the number of background sites uncovered
corresponds to PFA, and the number of tracks required to
construct this fraction of the graph corresponds to the SNR.
Note that the percentage of foreground sites uncovered is
necessarily monotonic with the number of tracks explored,
just as PD is monotonic with classical SNR.
Two independent datasets are used to evaluate performance
in this paper. The first consists of video imagery acquired
by a wide-area airborne sensor over a metropolitan area as
part of a collection called Project Bluegrass. Vehicle tracks
generated using this data were used in the destination detection
algorithms described in Section IV. The second dataset is
comprised of simulated vehicle motion data made available by
the National Geospatial-Intelligence Agency (NGA). This data
is derived from a scripted scenario that contains a clandestine
insurgent network and will be called the simulated insurgent
network data (Figure 3). The simulated data covers a 48-hour
time period and consists of approximately 116,720 vehicle
tracks between 4,478 locations made by 4,623 individual
actors. Of these, 31 locations and 22 actors are part of
the insurgent network. The simulated tracks in this data are
perfectly accurate and unambiguous. Consequently, there are
no complications arising from imperfect tracks, and destination
detection and clustering are not necessary. The insurgent
network data is therefore appropriate for evaluating graph
exploration and detection algorithms (Sections V and VI).
III. V
EHICLE TRACKING
Tracking of many vehicles in wide-area video scenes is
necessary to perform graph construction; however, accurate
video tracking by analysts is a manually intensive process,
requiring an analyst to focus on individual vehicles and
follow them from start to finish. This would seem to be a
task ideally suited for automation; however, current tracking
algorithms yield insufficiently accurate results to produce
reliable graphs. Tracks for network construction must connect
two destinations correctly, i.e., they must track vehicles from
source to destination. Vehicle tracking algorithms often make
mistakes in tracking vehicles through obscurations or in heavy
traffic environments. In these cases, automated algorithms
often swap tracks between vehicles, introducing false connec-
tions between unassociated destinations. Furthermore, the fact
that tracks are often composed of hundreds of independent
detections provides many chances for errors to occur. Even an
algorithm with 99% accuracy would likely make one or more
errors per track given the large number of opportunities for
error; therefore, current automated methods cannot be relied
upon to provide accurate, fully automated tracks for graph
construction. However, the automated tracks are very useful in
semi-automated tracking because relevant but imperfect tracks
can be repaired by analysts. Tracking algorithm specifics and
performance evaluation is beyond the scope of this paper.
Nonetheless, imperfect tracks are also useful for destination
detection, as described in Section IV.
IV. D
ESTINATION DETECTION
Construction of vehicle movement graphs requires a map-
ping from the behavior of real world vehicle motions to the
nodes and edges of a graph. Edges are generally defined as
tracks which link two destination points, i.e., a house and a
gas station. Nodes are then the set of discrete destinations in
the underlying vehicle movement network. Because vehicles
arriving at the same destination do not stop at identical
locations, automated detection of clusters of stopping vehi-
cles is needed to infer vehicle destinations. Additionally, the
automated algorithm provides information about destination
type, including the size of the destination and the number of
stops associated with it. This additional contextual information
can be used to better exploit the vehicle movement data.
Throughout this section “stop” refers to any time a vehicle
under track is not moving. Hence any start, stop, or pause of
a tracked vehicle is declared a stop for the purpose of graph
construction.
The most natural way to define a destination is to have
a human look at a map and classify areas into destinations,
such as parking lots and driveways. Any regions that are not
classified as some destination are then referred to as transient
regions, e.g., roads. There are some dual use regions, such as
2044
roadsides, whose use may depend upon the day, time of day or
other extraneous factors. Ignoring this complication, we tasked
human analysts to create destination truth sets based on video
imagery from the Bluegrass data set.
The goal of this section is to reproduce the manually
classified destinations as closely as possible using solely
automated techniques. Manual segmentation uses contextual
data which may not be available to a computer program.
Instead, the automated algorithm uses the motion data from
nearby vehicles. A destination is a place where groups of
vehicles stop for prolonged periods of time.
Implementation of an automated algorithm requires two
steps. First, a way to discriminate destination stops, i.e., stops
occurring at a destination, is needed. Often, vehicles will stop
at intersections or for traffic related reasons; while these stops
can have a long duration, they are not of interest in declaring
destinations. Given the set of all destination stops, we need a
way of identifying the discrete destinations that are visited by
these stops. That is, we must identify driveways and parking
lots, rather than generic stopping regions. These problems are
addressed in order.
Discrimination of destination stops from transient stops is
automatically performed using hypothesis testing with con-
textual information. Specifically, a computer can analyze the
behavior of nearby traffic and compute the probability that a
stop at a given position is a destination stop or a transient
stop. If much of the traffic previously observed to be traveling
near a position has been traveling at high speed, then it is
unlikely that a stop at that position is a destination stop.
Conversely, if almost all of the traffic near that position
has high decelerations, travels slowly, and stops frequently,
a car stopping at that position is likely to have arrived at its
destination. Technically, we use contextual data from observed
traffic to create a Gaussian-mixture prior with fixed variance
for the probability that a car stopping at position x has reached
a destination,
P (H
d
|x) =
P
yY
d
N(|y x|; σ)
P
yY
N(|y x|; σ)
, (1)
where H
d
is the destination hypothesis, Y
d
is a set of likely
destination stops, and Y is the set of all stops. In the current
algorithm, we use the tracker’s internal state at termination of
the track to choose the set Y
d
. The features of the specific stop
under test, including duration and abruptness, collectively Θ,
are then used to update this prior based on the specifics of the
stop to arrive at the posterior probability that a given stop is
at a destination,
P (H
d
|Θ, x) =
P (Θ|H
d
)P (H
d
|x)
P
H∈{H
d
,H
t
}
P (Θ|H)P (H|x)
, (2)
a consequence of Bayes’ rule and the conditional indepen-
dence of Θ and x given H
d
; H
t
is the transient hypothesis,
d denotes a destination stop, and t denotes a transient stop.
A likelihood test is then used to classify all stops into either
transients or destinations.
0
m
p
k
µ
k
k
x
n
z
n
Stop n
Destination k
Figure 4. Graphical model of the clustering algorithm. The hyperparameters
m, σ
0
, and θ specify how the clusters, k, are generated. The cluster parameters
µ
k
, Σ
k
,and p
k
then specify how the individual data points x
n
are generated
based on which cluster the points belong to z
n
.
The above algorithm was tested on the Bluegrass data
against manually declared stop locations. The probability of
detecting destination stops was 54%, at a false alarm rate
of 1% on all track ends. Stops were missed either from an
inability to track persistently or failure of the hypothesis test.
Note that this is an initial implementation of the algorithm
which we intend to improve. Specifically contextual informa-
tion was only learned from nearby stops, while the information
from nearby moving vehicles would seem to provide additional
necessary information (i.e., stopping in the middle of a busy
road is contraindicated).
The identified destination stops must now be clustered into
discrete destinations. There is not a single correct way to
perform this clustering, as the edges of parking lots need not be
well defined. Here the goal will be to obtain a clustering which
closely matches manual clustering results on the Bluegrass
data, and which scales gracefully as more data is added and
as the area under consideration is increased.
To perform this clustering, we first construct a simple
generative model for destinations and then use inference
techniques to optimize the model parameters which include the
locations and sizes of the destinations. A graphical diagram
of this model is presented in Figure 4. In the simple model
we approximate a destination as a Gaussian distribution of
destination stops. In other words given a destination’s mean,
µ
k
, and its covariance, Σ
k
, the stops associated with this
destination will be Gaussian distributed as N(µ
k
, Σ
k
). The
means of these Gaussians are assumed to be distributed
uniformly over the area of interest, and the variances according
to an inverse Wishart distribution. Each destination will also
have an associated rate of vehicle stops. One expects this rate
to vary between destinations as more vehicles stop in a parking
lot than in a driveway. Ignoring the total rate of vehicle stops
at all destinations we only consider the probability that a given
stopping vehicle will stop at a specific destination p
k
. By
only modeling this probability we allow the algorithm to scale
gracefully as the total number of stops considered increases.
This probability of a stop being at a given destination is
modeled using a Dirichlet process prior [5].
2045
This generative model allows us to jointly optimize the
number of destinations along with their locations, widths, and
densities. Starting from Bayes’ rule, we can integrate out the
nuisance parameters and obtain a close form solution for the
probability that any given set of destinations is correct [8].
The cost of fitting a given dataset partitioned as { x } to this
model is then C = log P ({ x }) which depends upon the
models hyperparameters m, σ
0
, θ. This cost can be expressed
with m = 3 as
C({ x }, σ
0
, θ) =
X
k
n
k
+ 2
2
log det(σ
2
0
I + n
k
Σ
k
)
+ 2 log Γ(n
k
) + log θ
, (3)
where Σ
k
is the sample covariance matrix of the points
in cluster k, n
k
the number of points in cluster k, and
Γ(n) = (n + 1)! is the gamma function. We use a variational
Bayesian expectation maximization (VBEM) algorithm [2], [8]
to minimize the cost for the observed data, which produces
discrete destination clusters specified by p
k
, µ
k
, Σ
k
, denoting
the density, location and width of the destination respectively.
The EM algorithm is initialized by randomly selecting points
and placing them into clusters of size σ
0
. The iterations of the
EM algorithm then allow the points to move between clusters
(i.e. changing their membership z
n
), allow clusters to merge,
and allow new clusters to be formed by single points. Each
iteration reduces the cost. When the cost no longer changes,
the algorithm terminates and the parameters of the resulting
clusters then specify the discrete destinations.
Note that the cost function of Eq. (3) has several desirable
properties. First, it is the sum over independent cluster scores,
ensuring that the clustering procedure will scale gracefully as
clusters whose centers are far away from a given cluster will
not perturb its score. Second, the clusters in the model do not
interact, so that the parameters of the cluster may be estimated
solely from the points assigned to that cluster. Finally, the
cluster score has a minimum at a number of clusters that is
substantially less than the number of data points (except for
small σ
0
). This property provides a parsimonious description
which we would expect, given the limited number of parking
lots and driveways present in the world.
There are three key parameters needed to specify the cluster-
ing score, σ
0
, θ, m, corresponding to the initial cluster width,
the cluster density, and the strength of the prior on cluster
width, respectively. The results are somewhat insensitive to
the initial prior strength, and so we arbitrarily set m = 3. The
cluster density, θ, is set by requiring that if two points are
separated by a distance of 2σ
0
then the score for clustering
them together should equal the score for clustering them
separately. Thus, if two isolated points are closer together
than 2σ
0
they will form a single cluster, while if they are
further apart they will form two distinct clusters. Finally, the
base width parameter σ
0
can be tuned to choose the clustering
size. Large σ
0
s will result in overclustering, while small σ
0
s
will induce significant underclustering. In practice, we choose
Figure 5. Example clustering of an intersection in Albuquerque, NM. Yellow
boxes are notional manual clustering. Blue circles are notional automated
clusters (one sigma ellipses). Red dots are the notional stops/data points that
are clustered. Actual clustering was evaluated using Bluegrass data.
Information Coverage Plot
Information Completeness
False Information Ratio
0 0.05 0.1 0.15 0.2 0.25
0.75
0.8
0.85
0.9
0.95
1
Figure 6. Information coverage (IC) plot of the performance of the
clustering algorithm [7]. The y-axis indicates the mutual information between
the automated and manual destination declarations normalized by the total
entropy. The x axis is the conditional entropy of the automated output given
the manual destinations again normalized by total entropy. The free parameter
in the IC plot is σ
0
, and the starred operating point is at σ
0
= 30 m.
σ
0
to be around 30 m, the approximate separation between
houses. This choice allows driveways for adjacent houses
to be clustered separately, while still permitting parking lots
to be clustered together. However, two immediately adjacent
driveways could be clustered together, and large parking lots
may be broken up if the distribution of cars is insufficiently
dense or insufficiently uniform.
Figure 5 shows an example clustering result on vehicle des-
2046
tinations around a urban intersection. Quantitative evaluation
of the automated clustering performance is done by comparing
it to the human clustering on the Bluegrass data set, shown
in Figure 6. The comparison is done using the information
theoretic metric [7]. A performance curve is traced out by
varying σ
0
. At the optimal point, with false alarms and missed
detections weighted equally, the automated clustering captures
above 95% of the information from manual clustering; the
amount of false information introduced by the automated
clustering is less than 2% of the total information needed to
identify destinations. As noted, the optimal initial cluster size
is about the size of a residence, and comparable results are
obtained at nearby initial sizes. This result is expected, as the
initial cluster size is only a suggestion and the actual cluster
width is learned by the algorithm.
V. C
UED GRAPH EXPLORATION
After vehicle destinations have been identified and clus-
tered into geographical sites of interest, a method is needed
to connect these sites together through vehicle tracks. Sites
may be specified by either the destination clustering method
just described, or by an analyst using external information. By
identifying relationships between sites, it is possible to build a
network containing the subnetwork associated with the original
cue. As discussed in Section III, automatic tracking of cars in
dense urban environments is subject to many types of errors.
This means that any connection between two locations will
require a semi-automated approach with a human in the loop.
The challenge with any semi-automated approach is that
while humans can reduce the problem of track errors, the time
required to hand-verify every track throughout an entire city
is prohibitively expensive. Therefore, an approach to prioritize
analyst tasking is needed to focus effort on following the
vehicles most relevant to constructing the foreground network.
In the case of community detection, relevant vehicles are
vehicles most likely to be part of the community of interest.
Though not the focus of this paper, alternate schemes could
consider relevance from an information theoretic approach,
whereby relevant vehicles are ones that provide maximum
information content for discriminating the foreground from
the background network.
The proposed algorithm is a cued graph exploration ap-
proach whereby a location of interest is identified (a cue) and
a graph grown beginning at the cue location. This cue location
is assumed to be known a priori and represents a site used
by the foreground community. Beginning with a cue allows
analysts to focus their attention on vehicles in a local region
(in graph space) around a known foreground location.
The graph exploration algorithm is based on the breadth-
first search (BFS) algorithm [13]. In breadth-first search, a
graph is initially formed by following all vehicles departing
or arriving at the cue location (node). Then for each location
(node) found in step one, all vehicles are followed again. The
priority assigned to exploring edge E in the graph is therefore
given by
Priority(E) = min d
v(E), V
c
, (4)
Figure 7. Graphs constructed by following vehicles from one location to
another. The left figures show vehicle movement overlayed on the aerial
imagery while the right figures show the same movement represented as a
graph. From top to bottom each figure shows the graph at various stages of
exploration using BFS. The foreground subgraph is shown using red nodes,
and the background graph is shown using gray vertices. The cue vertex is
shown in yellow.
where d is the standard graph distance between vertices, v(E)
are the vertices of edge E, and V
c
is the cued vertex. This
procedure is repeated until some fixed number of tracks are
explored. Nodes at the same distance away from the cue are
explored in random order and vehicles departing any given
node are followed in random order.
Figure 7 shows three graphs at various stages of exploration
using BFS on the simulated insurgent network data. Note that
while the vehicles may traverse long distances in physical
space, all locations discovered are within one, two, or three
hops from the cue node.
While BFS is good at exploring a neighborhood of nodes
surrounding a cue node, the final graph in Figure 7 demon-
strates a major drawback of this approach. The vast majority
of time in the final graph is spent exploring tracks leaving a
handful of high degree nodes. Under a fixed time constraint,
2047
this would not be an efficient use of human resources. This
observation that BFS can be biased towards high degree nodes
has also been observed in a number of other studies [3], [9].
In order to combat the bias towards high degree nodes,
a degree-weighed BFS approach is implemented. In degree-
weighted BFS, nodes at the same distance from a cue node
are explored in order of their degree, with low degree nodes
explored first and high degree nodes explored last. Additional
models of node relevancy may also be used, depending upon
the observable information available for each node and track.
Because nodes represent clustered track destinations, it is
possible to estimate the degree of each node by counting the
number of destinations in each cluster, thereby providing an
estimate of node degree before the exploration stage.
The exploration strategies are compared using two metrics.
One metric measures time required to explore the graph (hu-
man resources), and another measures efficiency at uncovering
foreground relative to background (ROC Analysis). Three
search strategies, random walk, BFS and degree-weighted
BFS, are compared using each metric. The random walk
strategy provides a baseline and represents a human following
random vehicles beginning at the tip node.
Figure 8 shows the percentage of the foreground network
found as a function of the number of tracks examined, which
is a proxy for the amount of human time required to uncover a
certain percentage of the foreground network. Figure 8 shows
that a local search in graph space such as BFS uncovers more
of the foreground network faster than an undirected random
search. Better still is degree-weighted BFS which finds 80%
of the foreground in approximately 250 tracks while standard
BFS requires nearly 800. This represents a significant, three-
times savings in human resources.
Figure 9 shows the percentage of foreground network found
against the percentage of the background network found.
This is similar to a traditional ROC curve indicating PD/PFA
performance regardless of human time required. This results
seen in Figure 9 are comparable to those of Figure 8 with
degree-weighted BFS outperforming other search methods.
VI. C
UED GRAPH DETECTION
Network detection is the final step of the network discovery
work flow. It provides discrimination between the foreground
and background network on the graph constructed from the
previous steps. The output of the detection step determines
the final overall performance of the system.
Intuitively, connectivity between nodes in the foreground
network is stronger than the connectivity between background
nodes to the foreground network. Discrimination based on
this structural difference has been demonstrated using spectral
methods by Newman [14]. Similarly, we perform detection
based on the projection of a graph into the subspace spanned
by a few eigenvectors of its modularity matrix B,
B = A
kk
T
2|E|
, (5)
0 200 400 600 800 1000
0
20
40
60
80
100
Tracks Examined
Foreground Found (%)
Weighted BFS
BFS
Random
Figure 8. Percentage of foreground network found as a function of vehicle
tracks examined. Random search is shown in thin blue (
), BFS is shown
in medium green (
) and degree-weighted BFS is shown in thick red ( ).
0 10 20 30 40 50
50
60
70
80
90
100
Background Found (%)
Foreground Found (%)
Weighted BFS
BFS
Random
Figure 9. Percentage of foreground network found as a function of
background network founds. Random search is shown in thin blue (
), BFS
is shown in medium green (
) and degree-weighted BFS is shown in thick
red (
).
where A is the observed adjacency matrix, k is the vector of
node degrees, and |E| is the total number of edges. The mod-
ularity matrix can be interpreted as the difference between the
observed and the expected number of edges between any pair
of nodes. Performing the eigen-decomposition B =: UΛU
T
provides eigenvectors U as candidate bases. Corresponding
to the algorithm described by Miller et al. [10]–[12], the
principal eigenvectors that have the largest components along
the dimension of the cue node are selected to form the
detection subspace. Nodes, mapped to this subspace, are then
detected by thresholding on their Euclidean norm. By varying
the detection threshold, a receiver operating curve (ROC) is
generated for each constructed graph with varying number of
explored tracks (Figure 10).
For each of the detection ROCs, the top right corner of
the curve represents the detection performance if all explored
2048
0 1 2 3 4 5
0
20
40
60
80
100
Background Found (PFA, %)
Foreground Found (PD, %)
1000 Tracks Explored
150 Tracks Explored
600 Tracks Explored
Figure 10. Detection ROC on graphs with varying number of explored tracks,
using degree-weighted breadth first search. The curves indicates detection
performance on a graph with 150 (thin blue
), 600 (medium green ),
and 1000 (thick red
) tracks explored.
nodes are declared as the foreground. Starting at the smallest
graph with only 150 explored tracks, 35% of the foreground
network is missed because the entire network cannot be
reached with this relatively small number of tracks. The
optimal performance is seen when 600 tracks (i.e., 0.5% of
the total tracks) are explored, where the detector achieves 87%
PD and 1.5% PFA. At this sampling level, the constructed
graph includes the majority of the foreground network as well
as additional information on the network topology, leading to
better discrimination in the detection step. Beyond 600 tracks,
the overall performance plateaus so that additional exploration
does not yield greater performance. Our detection analysis
highlights the advantage of coupling the detection step with the
exploration step, which achieves higher overall performance.
The detection step provides false alarm mitigation because
many of the explored nodes are actually part of the background
network. The exploration step provides optimal inclusion of
the foreground network while minimizing the size of the
constructed graph, reducing the analyst workload needed for
graph construction.
VII. C
ONCLUSIONS
This paper presents an end to end approach to perform
network discovery from wide-area surveillance data. A traffic
network graph is constructed in a semi-automated fashion
where human analysts are aided with automated algorithms
such as tracking, destination detection, site clustering, graph
exploration, and graph detection. We demonstrate efficient
graph exploration starting from a cue node and good final
detection performance on the simulated insurgent network
dataset comprised of 4,478 locations and 116,720 tracks. The
degree-weighted breadth-first search model for node relevancy
is shown to uncover 80% of the foreground network in
approximately 250 tracks while standard BFS requires nearly
800, representing a significant, three-times potential savings
in human resources. Detection performance on the simulated
foreground graph is shown to be 87% PD and 1.5% PFA.
The strength of this approach centers on the ability to focus
analysis on a very small part of the immense video data
stream. While improvements can be made to every step of the
processing chain, we believe this approach provides a novel
and promising paradigm in conducting clandestine network
discovery.
R
EFERENCES
[1] AFRL CLIF 2007 dataset over Ohio State University.
h https://www.sdms.afrl.af.mil/datasets/clif2007 i.
[2] M. J. B
EAL. Variational Algorithms for Approximate Bayesian Infer-
ence. Ph.D. Thesis, Gatsby Computational Neuroscience Unit, Univer-
sity College London, 2003.
[3] L. B
ECCHETTI, C. CASTILLO, D. DONATO, and A. FAZZONE. A
comparison of sampling techniques for web graph characterization, in
Proc. Workshop on Link Analysis (LinkKDD), Philadelphia, PA, 2006.
[4] D. M. B
LEI, A. NG, and M. I. JORDAN. “Latent Dirichlet Allocation,
Journal of Machine Learning Research 3 : 993–1022, 2003.
[5] T. S. F
ERGUSON. A Bayesian analysis of some nonparametric prob-
lems. Annals of Statistics 1 : 209–230, 1973.
[6] C. G
ODSIL and G. ROYLE. Algebraic Graph Theory. New York:
Springer-Verlag, Inc. 2001.
[7] R. S. H
OLT, P. A. MASTROMARINO, E. K. KAO, and M. B. HURLEY.
“Information theoretic approach for performance evaluation of multi-
class assignment systems, in Proc. Signal Proc., Sensor Fusion, and
Target Recognition XIX (SPIE), ed., Ivan Kadar. Orlando, FL, April
2010.
[8] D. K
OLLER and N. FRIEDMAN. Probabilistic Graphical Models. Cam-
bridge, MA: The MIT Press, 2009.
[9] M. K
URANT, A. MARKOPOULOU, and P. THIRAN. “On the bias of
BFS (Breadth First Search),” in Proc. 22d Intl. Teletraffi Congress (ITC),
Amsterdam, Netherlands, 2010.
[10] B. A. M
ILLER, M. S. BEARD, and N. T. BLISS. “Eigenspace Analysis
for Threat Detection in Social Networks, to appear, Fusion, 2011.
[11] B. A. M
ILLER, N. T. BLISS, and P. J. WOLFE. “Toward signal
processing theory for graphs and other non-Euclidean data, in Proc.
IEEE Intl. Conf. Acoustics, Speech and Signal Processing, pp. 5414–
5417, 2010.
[12] B. A. M
ILLER, N. T. BLISS, and P. J. WOLFE. “Subgraph detection
using eigenvector L
1
norms, in Proc. 2010 Neural Information Pro-
cessing Systems (NIPS), Vancouver, Canada, 2010.
[13] M. E. J. N
EWMAN. Networks: An Introduction, Oxford University Press,
2010.
[14] M. E. J. N
EWMAN. “Finding community structure in networks using
the eigenvectors of matrices, Phys. Rev. E, 74 (3), 2006.
[15] H. L. V
AN TREES. Detection, Estimation, and Modulation Theory,
Part 1. New York: John Wiley and Sons, Inc. 1968.
[16] J. X
U and H. CHEN. “The topology of dark networks, Comm. ACM
51 (10) : 58–65, 2008.
2049
... This highly sparse linear system may be solved by iterative methods such as the biconjugate gradients, which provide a practical computational approach that scales well to graphs with thousands of vertices and thousands of time samples, resulting in space-time graphs of order 10 million or more. In practice, significantly smaller subgraphs are encountered in applications such as threat network discovery, for which linear solvers with sparse systems are extremely fast [27]. ...
... A simulated empirical dataset is used to evaluate spacetime threat propagation performance [27]. These data are derived from a scripted scenario performed by the Institute for Defense Analyses (IDA) that contains a clandestine insurgent network, illustrated in Figure 10. ...
... On the other hand, a higher expected number of meeting times (e.g., ψ k = 20) yields a community whose activities are Space-time threat propagation FIGURE 10. A single simulated experiment, shown on the left and embedded in the center map, is generated to assess the receiver operating characteristic performance of various network detection methods in a wide-area motion imagery example [27]. The simulated insurgent network graph is composed of 4478 locations and 116,720 tracks. ...
Article
Full-text available
Network analysis has been a major research area over the last ten years, driven by interest in biological networks, cyber attacks, social networks, and criminal or terrorist organizations. This range of applications is illustrated in Figure 1. Detection of a covert community is most likely to be effective if the community exhibits high levels of connection activity. However, the covert networks of interest to many applications are unlikely to cooperate with this optimistic assumption. Indeed, a "fully connected network is an unlikely description of the enemy insurgent order of battle [1]." A clandestine or covert community is more likely to appear cellular and distrib-uted [2]. Communities of this type can be represented with "small world" models [3]. The covert networks of interest in this paper exist to accomplish nefarious, illegal, or terrorist goals while "hiding in plain sight [4, 5]." Covert networks necessarily adopt operational pro-cedures to remain hidden and robustly adapt to losses of parts of the network. For example, during the Alge-rian Revolution, the National Liberation Front's (FLN) Autonomous Zone of Algiers (ZAA) military command was ''carefully kept apart from other elements of the organization, the network was broken down into a num-ber of quite distinct and compartmented branches, in communication only with the network chief," allowing ZAA leader Yassef Saadi to command "within 200 yards from the office of the [French] army commandant... and remain there several months [6]." Valdis Krebs' recon-struction of the 11 September 2001 terrorist network details the strategy for keeping cell members distant from each other and from other cells, and notes Osama Covert network detection is an important capability in areas of applied research in which the data of interest can be represented as a relatively small subgraph in an enormous, potentially uninteresting background. This aspect characterizes covert network detection as a "Big Data" problem. In this article, a new Bayesian network detection framework is introduced that partitions the graph on the basis of prior information and direct observations. We also explore a new generative stochastic model for covert networks and analyze the detection performance of both classes of optimal detection techniques. » 48 LINCOLN LABORATORY JOURNAL n VOLUME 20, NUMBER 1, 2013 COVERT NETWORK DETECTION bursts of activity that a covert community may be most vul-nerable to detection [1]. Because the connections between nodes are observ-able only when they are active, there are two basic strat-egies for detecting a covert threat: (1) subject-based Bayesian models that correlate a priori information or observations of the observed network connections; (2) pattern-based (predictive) methods that look for known patterns of organization and behavior to infer nefarious activity [4, 10]. Subject-based methods follow established principles of police investigations to accrue evidence based upon observed connections and historical data. The dependency of predictive methods on known patterns, however, makes them difficult to apply to rare and widely different covert threats: "there are no meaningful patterns bin Laden's description of this organization: "those... who were trained to fly didn't know the others. One group of people did not know the other group [7]." This type of organization is characterized by a tree, as shown in Figure 2. A covert network does not have to be human to be nefarious; the widespread Flashback malware attack on Apple's OS~X computers employed switched load balancing between servers to avoid detec-tion [8], mirroring the ZAA's tree structure for robust covert network organization. In order to accomplish its goals, the covert net-work must judiciously use "transitory shortcuts [9]." For example, in the 9/11 terrorism operation, after coordina-tion meetings connected distant parts of the network, the "cross-ties went dormant [7]." It is during these occasional
... In the case where connections between the graph's vertices are dynamic, temporal correlations may be used to improve the performance of graph detection. This paper considers the problem of "threat propagation" motivated by the example of geographic sites connected together by a set of time-stamped tracks [6,7,11]; this particular formulation is analogous to the centrality metrics commonly used in network analysis [4,8,9]. Several new concepts and efficient algorithms are introduced that address the problems of exploration and detection in graphs with both spatial and temporal characteristics-space-time graphs. ...
... This approach scales well to graphs with thousands of vertices and thousands of time samples, resulting in space-time graphs of order ten million or more. In practice, significantly smaller subgraphs are encountered in applications such as threat network discovery [11], for which linear solvers with sparse systems are extremely fast. ...
... The space-time threat propagation algorithms developed in the previous sections can be used for both prioritized discovery of threat networks [11] and for threat network detection [10,11]. In this section, detection performance results the later application will be shown simulated vehicle motion data from the National Geospatial-Intelligence Agency (NGA). ...
Article
Full-text available
This paper addresses threat propagation on space-time graphs, defined to be a time-sampled graph. The application considered is geographi-cal sites connected by tracks, though such graphs arise in many fields. Several new concepts and efficient algorithms are introduced, specif-ically, the space-time adjacency matrix and harmonic threat propaga-tion. The cued threat propagation problem is shown to be equivalent to the harmonic solution to Laplace's equation on the graph. Alternately, the Perron-Frobenius theorem is applied to a modified space-time adja-cency matrix to derive a concept of eigen-threat on space-time graphs. Both approaches yield fast, scalable algorithms for space-time threat propagation applicable to both very small and very large graphs. Al-gorithms are motivated by a continuous time stochastic process model. Detection performance is shown using a simulated insurgent network data for which harmonic space-time threat propagation achieves an 84% probability of detection with a 4% false alarm probability over the entire graph.
... It finds the underlying community memberships of network entities, given the relationships and interactions between them. There are a wide variety of such data on communication, social, and biological networks; for example, email traffic between employees of a company [1], vehicle traffic between physical locations [2], collaborations between scientists [3], proteinprotein interactions [4], etc. Discovering community membership is of practical value because it reveals group identities not readily observable and provides insight on the behaviors within and between groups. ...
... A small subset of this network corresponds to the operations of an insurgent cell that conducts activities at 31 different nodes over the course of the 48 hour period. For a more detailed discussion of this dataset see [2]. ...
Conference Paper
Existing literature on network community detection typically exploits the structure of static associations between entities. However, real world network data often consists of observations of coordinated interactions between members who belong to multiple communities. This paper presents a novel perspective and approach for activity-based community detection, where a community is defined as a group of actors engaged in correlated activities over time. Detection is performed by propagating membership iteratively to neighboring nodes through edges that represent interactions. We compare the proposed approach to two state-of-the-art methods based on modularity, and demonstrate its effectiveness on a simulated vehicle movement dataset and the Enron email corpus.
... Network detection is the objective in many diverse graph analytic applications, ranging from graph partitioning, mesh segmentation, manifold learning, community detection [44], network anomaly detection [10], and the discovery of clandestine networks [32], [43], [52], [56], [70]. A new Bayesian approach to network detection is developed and analyzed in this paper, with specific application to detecting small, covert networks embedded within much larger background networks. ...
... In practice, the highly sparse linear system of Eq. (25) may be solved by simple repeated iteration of Eq. (13), or using the biconjugate gradient method, which provides a practical computational approach that scales well to graphs with thousands of vertices and thousands of time samples in the case of space-time threat propagation, resulting in graphs of order ten million or more. In practice, significantly smaller subgraphs are encountered in applications such as threat network discovery [56], for which linear solvers with sparse systems are extremely fast. ...
Article
Full-text available
A novel unified Bayesian framework for network detection is developed, under which a detection algorithm is derived based on random walks on graphs. The algorithm detects threat networks using partial observations of their activity, and is proved to be optimum in the Neyman-Pearson sense. The algorithm is defined by a graph, at least one observation, and a diffusion model for threat. A link to well-known spectral detection methods is provided, and the equivalence of the random walk and harmonic solutions to the Bayesian formulation is proven. A general diffusion model is introduced that utilizes spatio-temporal relationships between vertices, and is used for a specific space-time formulation that leads to significant performance improvements on coordinated covert networks. This performance is demonstrated using a new hybrid interaction model introduced to simulate random covert networks with realistic properties.
... Understanding phenomena in real world networks is a prominent field of research in many areas. There are a wide variety of inferential tasks on phenomena in communication, social, and biological networks; for example, email traffic between employees of a company [4], vehicle traffic between physical locations [17], collaborations between scientists [14], protein-protein interactions [20]. These studies range from clustering nodes into discrete communities to anomaly detection to inferring attributes on individual nodes to searching for specific activity embedded in background population clutter. ...
... For example, in wide-area, aerial surveillance systems, vehicle tracks can be used to forensically reconstruct clandestine terrorist networks. Greater detection can be achieved using network topology modeling to separate the foreground network embedded in a much larger background network [17]. Location data recorded by mobile phones has been used to track the movement of large groups of students on a university campus to learn more about human behavior and social network formation [18]. ...
Article
Full-text available
The rapidly growing field of network analytics requires data sets for use in evaluation. Real world data often lack truth and simulated data lack narrative fidelity or statistical generality. This paper presents a novel, mixed-membership, agentbased simulation model to generate activity data with narrative power while providing statistical diversity through random draws. The model generalizes to a variety of network activity types such as Internet and cellular communications, human mobility, and social network interactions. The simulated actions over all agents can then drive an application specific observational model to render measurements as one would collect in real-world experiments. We apply this framework to human mobility and demonstrate its utility in generating high fidelity traffic data for network analytics.
... This highly sparse linear system may be solved by the biconjugate gradient method, which provides a practical computational approach that scales well to graphs with thousands of vertices and thousands of time samples, resulting in space-time graphs of order ten million or more. In practice, significantly smaller subgraphs are encountered in applications such as threat network discovery [46], for which linear solvers with sparse systems are extremely fast. Finally, a simple application of Bayes' theorem to the harmonic threat ϑ v = f (Θ v |z) provides the optimum Neyman-Pearson detector [Eq. ...
... This highly sparse linear system may be solved by the biconjugate gradient method, which provides a practical computational approach that scales well to graphs with thousands of vertices and thousands of time samples, resulting in space-time graphs of order ten million or more. In practice, significantly smaller subgraphs are encountered in applications such as threat network discovery [46], for which linear solvers with sparse systems are extremely fast. Finally, a simple application of Bayes' theorem to the harmonic threat ϑ v = f (Θ v |z) provides the optimum Neyman-Pearson detector [Eq. ...
Article
Full-text available
Network detection is an important capability in many areas of applied research in which data can be represented as a graph of entities and relationships. Often-times the object of interest is a relatively small subgraph in an enormous, potentially uninteresting background. This aspect characterizes network detection as a "big data" problem. Graph partitioning and network discovery have been major research areas over the last ten years, driven by interest in internet search, cyber security, social networks, and criminal or terrorist activities. The specific problem of network discovery is addressed as a special case of graph partitioning in which membership in a small subgraph of interest must be determined. Algebraic graph theory is used as the basis to analyze and compare different network detection methods. A new Bayesian network detection framework is introduced that partitions the graph based on prior information and direct observations. The new approach, called space-time threat propagation, is proved to maximize the probability of detection and is therefore optimum in the Neyman-Pearson sense. This optimality criterion is compared to spectral community detection approaches which divide the global graph into subsets or communities with optimal connectivity properties. We also explore a new generative stochastic model for covert networks and analyze using receiver operating characteristics the detection performance of both classes of optimal detection techniques.
... With a cue to a subgraph vertex, the subspace in the figure may be found by looking at the most correlated eigenvectors. The use of cues with the spectral techniques presented here, currently being explored in [16], may prove to be quite powerful. ...
Conference Paper
The problem of detecting a small, anomalous subgraph within a large background network is important and applicable to many fields. The non-Euclidean nature of graph data, however, complicates the application of classical detection theory in this context. A recent statistical framework for anomalous subgraph detection uses spectral properties of a graph's modularity matrix to determine the presence of an anomaly. In this paper, this detection framework and the related algorithms are applied to data focused on a specific application: detection of a threat subgraph embedded in a social network. The results presented use data created to simulate threat activity among noisy interactions. The detectability of the threat subgraph and its separability from the noise is analyzed under a variety of background conditions in both static and dynamic scenarios.
Chapter
This chapter addresses the application of data and information fusion to the design of integrated systems in the Homeland Protection (HP) domain. HP is a wide and complex domain: systems in this domain are large (in terms of size and scope) integrated (each subsystem cannot be considered as an isolated system) and different in purpose. Such systems require a multidisciplinary approach for their design and analysis and they are necessarily required to provide data and information fusion in the most general sense. The first data fusion algorithms employed in real systems in the radar field go back to the early seventies; now a days new concepts have been developed and spread to be applied to very complex systems with the aim to achieve the highest level of intelligence as possible and hopefully to support decision. Data fusion is aimed to enhance situation awareness and decision making through the combination of information/data obtained by networks of homogeneous and/or heterogeneous sensors. The aim of this chapter is to give an overview of the several approaches that can be followed to design and analyze systems for homeland protection. Different fusion architectures can be drawn on the basis of the employed algorithms: they are analyzed under several aspects in this chapter. Real study cases applied to real world problems of homeland protection are provided in the chapter.
Article
Full-text available
Multi-class assignment is often used to aid in the exploitation of data in the Intelligence, Surveillance, and Reconnaissance (ISR) community. For example, tracking systems collect detections into tracks and recognition systems classify objects into various categories. The reliability of these systems is highly contingent upon the correctness of the assignments. Conventional methods and metrics for evaluating assignment correctness only convey partial information about the system performance and are usually tied to the specific type of system being evaluated. Recently, information theory has been successfully applied to the tracking problem in order to develop an overall performance evaluation metric. In this paper, the information-theoretic framework is extended to measure the overall performance of any multiclass assignment system, specifically, any system that can be described using a confusion matrix. The performance is evaluated based upon the amount of truth information captured and the amount of false information reported by the system. The information content is quantified through conditional entropy and mutual information computations using numerical estimates of the association probabilities. The end result is analogous to the Receiver Operating Characteristic (ROC) curve used in signal detection theory. This paper compares these information quality metrics to existing metrics and demonstrates how to apply these metrics to evaluate the performance of a recognition system.
Article
Full-text available
We present a detailed statistical analysis of the characteris- tics of partial Web graphs obtained by sub-sampling a large collection of Web pages. We show that in general the macroscopic properties of the Web are better represented by a shallow exploration of a large number of sites than by a deep exploration of a limited set of sites. We also describe and quantify the bias induced by the different sampling strategies, and show that it can be significant even if the sample covers a large fraction of the collection.
Book
Full-text available
Graphs.- Groups.- Transitive Graphs.- Arc-Transitive Graphs.- Generalized Polygons and Moore Graphs.- Homomorphisms.- Kneser Graphs.- Matrix Theory.- Interlacing.- Strongly Regular Graphs.- Two-Graphs.- Line Graphs and Eigenvalues.- The Laplacian of a Graph.- Cuts and Flows.- The Rank Polynomial.- Knots.- Knots and Eulerian Cycles.- Glossary of Symbols.- Index.
Conference Paper
Full-text available
Breadth First Search (BFS) and other graph traversal techniques are widely used for measuring large unknown graphs, such as online social networks. It has been empirically observed that incomplete BFS is biased toward high degree nodes. In contrast to more studied sampling techniques, such as random walks, the bias of BFS has not been characterized to date. In this paper, we quantify the degree bias of BFS sampling. In particular, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction of covered nodes, in a random graph RG(pk) with a given (and arbitrary) degree distribution pk. Furthermore, we also show that, for RG(pk), all commonly used graph traversal techniques (BFS, DFS, Forest Fire, and Snowball Sampling) lead to the same bias, and we show how to correct for this bias. To give a broader perspective, we compare this class of exploration techniques to random walks that are well-studied and easier to analyze. Next, we study by simulation the effect of graph properties not captured directly by our model. We find that the bias gets amplified in graphs with strong positive assortativity. Finally, we demonstrate the above results by sampling the Facebook social network, and we provide some practical guidelines for graph sampling in practice.
Book
Probabilistic Graphical Models discusses a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions to deal with dynamical systems and relational data. For each class of models, the text describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques. Finally, the book considers the use of the proposed framework for causal reasoning and decision making under uncertainty.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
The problem of detecting a small, anomalous subgraph within a large background network is important and applicable to many fields. The non-Euclidean nature of graph data, however, complicates the application of classical detection theory in this context. A recent statistical framework for anomalous subgraph detection uses spectral properties of a graph's modularity matrix to determine the presence of an anomaly. In this paper, this detection framework and the related algorithms are applied to data focused on a specific application: detection of a threat subgraph embedded in a social network. The results presented use data created to simulate threat activity among noisy interactions. The detectability of the threat subgraph and its separability from the noise is analyzed under a variety of background conditions in both static and dynamic scenarios.