Page 1

Visualization Databases of the Analysis of Large Complex Datasets

Saptarshi Guha

Statistics Dept.

Purdue Univ.

West Lafayette, IN

Paul Kidwell

Statistics Dept.

Purdue University

West Lafayette, IN

Ryan Hafen

Statistics Dept.

Purdue Univ.

West Lafayette, IN

William S. Cleveland

Statistics Dept.

Computer Science Dept.

Purdue Univ.

West Lafayette, IN

Abstract

Comprehensive visualization that preserves

the information in a large complex dataset re-

quires a visualization database (VDB): many

displays, some with many pages, and with

one or more panels per page. A single dis-

play using a specific display method results

from partitioning the data into subsets, sam-

pling the subsets, and applying the method

to each sample, typically one per panel. The

time of the analyst to generate a display is

not increased by choosing a large sample over

a small one. Displays and display viewers can

be designed to allow rapid scanning, and of-

ten, it is not necessary to view every page of

a display. VDBs, already successful just with

off-the-shelf tools, can be greatly improved

by a rethinking of all areas of data visual-

ization in the context of a database of many

large displays.

1 Introduction

Large, complex datasets have some of the following

properties, often all: a large number of records; many

variables; complex data structures not readily put into

a tabular form of cases by variables; intricate patterns

and dependencies in the data that require complex

models and methods of analysis. Our goal, despite

the complexity, should be comprehensive study that

does not lose important information contained in the

data.

Nothing serves comprehensive analysis better than

data visualization, the only practical way to absorb

Appearing in Proceedings of the 12thInternational Confe-

rence on Artificial Intelligence and Statistics (AISTATS)

2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR:

W&CP 5. Copyright 2009 by the authors.

a large amount of information.

been accepted and practiced for decades.

this regression model: the mean of a numeric response

is assumed linear in three numeric explanatory vari-

ables, and the errors are assumed i.i.d. N(0,σ2). Sup-

pose there are 100 observations of the response and

each of the explanatory variables, 400 numeric val-

ues altogether.To check the linearity and normal-

ity of the model, it is common practice to make, at

the very least, a collection of standard displays [1, 2]:

scatterplot matrix of the four variables (1200 plot-

ted points); three partial residual plots, one for each

explanatory variable (300 points); three conditioning

plots of the response against and explanatory variable

conditional on the other two, one for each explanatory

variable (300 points); residuals against fitted values

(100 points); absolute residuals against fitted values

(100 points); normal quantile plot (100 points); three

conditioning plots with residuals in place of the re-

sponse, one for each explanatory variable (300 points).

The number of plotted points is 2400, and each point

encodes two numeric values, so 4800 values are dis-

played. This means the ratio of graphed numeric val-

ues is 12 times the number of numeric values in the

data.

This principle has

Consider

In an effort to achieve comprehensive analysis of a

large dataset with billions or trillions of observations

we obviously cannot achieve a factor of 12 in data dis-

plays. But we can make a large number of displays

to pursue comprehensive analysis, many with a large

number of pages, each of which can have many pan-

els. The total number of pages might be measured in

thousands or tens of thousands. We call the displays

a visualization database, or VDB. The name applies

equally well to displays of the above regression exam-

ple, so the concept of a visualization database is not

new.

The difference between the large and the small

datasets is that we must partition the data into sub-

sets, sample the subsets, and apply a visualization

Page 2

Visualization Databases

method to each sample. We subject each sample sub-

set to the same comprehensive analysis as the small

dataset. Of course, the sampling frame must be cho-

sen in a way that characterizes the data. Backing us

up are numeric methods that can often be run on all

subsets, or on a sample much larger than that for mak-

ing displays, that help in making good sampling frame

choices.

We are very optimistic because we have had substan-

tial success for many data analysis and methodological

projects. invoking just a few basic concepts for VDBs,

a new distributed computing environment described

later, new methods for display design and sampling,

and with off-the-shelf initial solutions many VDB com-

ponents, VDB performance can increase by a rethink-

ing of all areas involved in data visualization, including

the following: Methods of display design that enhance

pattern perception to enable rapid page scanning; Au-

tomation algorithms for basic display elements such

as the aspect ratio, scales across panels, line types and

widths, and symbol types and sizes; Methods for sub-

set sampling; Viewers designed for multi-panel, multi-

page displays that scale across different amounts of

physical screen area.

This article discusses the basic concepts for a VDB,

its hardware and software components, and the devel-

opment of methods and algorithms that can improve

the performance of VDBs. Readers are encouraged to

look at a Web site prepared in coordination with this

article that houses a number of VDB managers, some

evolving because they involve current projects [3]. We

will make reference in this article to three of the VDBs

— named surveillance, joke, and connection — that

involve the analysis of three data sets.

The surveillance data are the daily counts of chief com-

plaints from emergency departments (EDs) of the In-

diana Public Health Emergency Surveillance System

(PHESS) [4]. The complaints are divided into eight

classifications; one is respiratory. Data for the first

EDs go back to November 2004, and new EDs have

come online continually since then. There are now 76

EDs in the system. Respiratory complaints for the

30 EDs with the most data, are analyzed in [5], and

surveillance is the VDB for this analysis.

The Jester project [6, 7] has collected ratings on a scale

of −10 to 10 of 100 jokes from 73,421 raters from April

1999 to May 2003. Our joke dataset has the 14,116

raters who evaluated all jokes. Visualization methods

are revealing the properties of the data as a guide to

building a statistical model that will allow prediction

of the ratings of an individual from a set of 10 gauge

jokes.

Much Internet communication consists of connections

between two hosts who send packets back and forth,

each 1500 bytes or less. The TCP protocol manages

the connections for many applications such as web

page delivery (http), email (smtp), and encrypted re-

mote login (ssh). Our connection dataset is packet

traces for TCP collected on a subnet of the Purdue

Statistics Department and organized by connection.

The traffic monitor sees all traffic between two VLANS

that make up the subnet, and between the subnet

and the outside. The data for each packet are the

arrival timestamp and certain information from the

TCP and IP headers: source-destination IP addresses

(anonymized), ports, and sequence numbers; size of

payload in packet; values of size flags — SYN, FIN,

PSH, RST, and ACK; and ACKed sequence number.

Trace collection has been carried out on four separate

days for a total of 96 hr. There are 749,128 connec-

tions; the binary version of the raw data is 146 giga-

bytes; this is converted to a distributed linux flatfile

database of 190 gigabytes with 1.49 billion rows, where

each row corresponds to a packet. The packets are or-

ganized by connection because the research topic is an

approach to network security based on analysis of con-

nection properties as a function of time and the logical

network topology.

2 Partitions, Analysts, Computing

2.1 Partitions

An important strategy of comprehensive analysis for a

large complex dataset is to partition it into small sub-

sets in one or more ways, and apply numeric methods

and visualization methods to each of a sample of sub-

sets. Section 1 discussed the use of a large number of

plotted points per observation that is commonly car-

ried out for small datasets. Achieving comprehensive

analysis of a large dataset requires preserving this for

the subsets, analyzing each in detail. The sampling

method can vary by method; it is common to have

numeric methods applied to more subsets than visu-

alization methods. In some cases, sampling can be

exhaustive: all subsets. Two non-exhaustive sampling

methods, representative and importance, are discussed

in Section 3.

Partitioning can be carried out in many different ways.

Often, we start with a core partition that arises nat-

urally from the structure of the raw data. This is a

soft concept but useful nevertheless. The subsets of

the core then are often further partitioned by variables

other than those that defined the core.

Page 3

Saptarshi Guha, Ryan Hafen, and Paul Kidwell and William S. Cleveland

2.2The Data Analyst

The time for a data analyst to create a display for a

single subset by writing commands in the environment

used for computing with the data can vary from very

small to large. But once the the commands are writ-

ten, a reasonable programming environment results in

a negligible additional command-time difference be-

tween small and large samples. So in this regard, a

large visualization database is not significantly more

costly than a small one.

The data analyst spends more time looking at a large

sample than at a small sample. To understand the

data as a whole, there needs to be a requisite number

of subsets, and this can be quite large. But study-

ing displays and thinking about the data, and not the

programming language, is time well spent.

not have to be an undue burden. The implication of

a very large number of displayed subsets is that the

display resulting from an application of a visualiza-

tion method will take up a very large amount of vir-

tual screen space, far larger than the available physi-

cal screen space, and so must be viewed sequentially.

Some displays might be scanned entirely; for others,

a small fraction of the pages might suffice. In work

described briefly in Section 3, we are developing meth-

ods of display design that invoke principles of visual

perception to enhance pattern perception and enable

rapid page scanning. We are also developing viewers

for the sequential task that can reduce the analyst time

substantially.

It does

2.3Computing

Partitioning, because it leads to embarrassingly par-

allel computation, can benefit immensely from dis-

tributed computing environments such as RHIPE [8],

a recent merging of the R interactive environment for

data analysis [9] and the Hadoop distributed file sys-

tem and compute engine [10]. This benefits all meth-

ods used in the analysis project, both numeric and

visual.

3Sampling, Trellis, Three VDBs

In representative sampling, survey variables are de-

fined that measure properties of the subsets. A subset

sampling frame is chosen to encompass the multidi-

mensional space of the survey variables and to spread

out the points in the space in a uniform way by some

definition. In importance sampling, samples have val-

ues of the variables that lie in a specific region of im-

portance of the multidimensional space of the vari-

ables.

Partitioning for data visualization has been going on

for small data sets for some time and is the stimulus for

trellis display, a framework for visualization that pro-

vides conditioning plots: displays of a set of variables

conditional on the variables of others [11]. The trellis

framework works well for the partitioning of large com-

plex datasets, creating display documents with pages,

and with panels on each page in a rectangular array.

Often, each panel shows a core subset, but in some

cases a single core subset can spread out across many

panels when additional variables partition the core fur-

ther.

3.1Joke Dataset

Two core partitions have been used — by joke and by

rater — resulting in 100 and 73,421 subsets respec-

tively. For the by-joke partition, the sampling for all

numeric and visualization methods is exhaustive. For

the by-rater partition, sampling for numeric methods

is exhaustive. For the by-rater partition, sampling for

visualization methods is representative. All models ex-

plored for the data have a rater location effect because

there are hard graders and easy graders; the estimates

of the location effect are one of our survey variables.

One diagnostic display of any model is a normal quan-

tile plot of residuals for each rater. Our representative

sampling frame for this method is 4000 raters selected

so that the rater-effect estimates are as close to uni-

formly spaced as possible from the minimum to the

maximum estimate.

Figure 1 is the first page of a 100-page, 4000-panel

trellis display of the representative sample for the nor-

mal quantile method. Each panel is a normal quantile

plot of the residuals for one rater and a model fitted

to a logistic transformation of the ratings rescaled to

the interval [0,1]. The line on the plot goes through

the two quartile points of the display. As we go left

to right, bottom to top, and through the pages of the

display, there is an increase in the estimates of the

rater location effect, shown in the lower right of each

panel of the display. The strip labels, which could

have shown these values, have been eliminated to save

space. The model in this case is very simple, additive

main effects of jokes and raters, and error terms with

identical normal distributions with mean zero.

The resultant-vector banking aspect ratio (see Sec-

tion 5) of a quantile plot is 1, a quantile plot can fit in

a relatively small space, and the aspect ratios of our

screens are 0.625, so we make 4000 plots: 100 pages,

each with 8 columns and 5 rows of panels, and each

page nearly fills our smallest physical screen since 5/8

= 0.625 (see Section 4). The first page of the display

is shown in Figure 1.

Page 4

Visualization Databases

Normal Quantile

Residual Rating (logistic scale)

−2

0

2

−2 02

−1.78−1.63

−2 02

−1.55−1.51

−2 02

−1.43−1.39

−2 02

−1.31−1.31

−1.27 −1.26 −1.24−1.21 −1.19 −1.17 −1.17

−2

0

2

−1.15

−2

0

2

−1.13−1.12−1.1−1.09−1.08−1.08−1.07−1.06

−1.05−1.05−1.05−1.04−1.03−1.02−1.01

−2

0

2

−1

−2

0

2

−0.98

−2 02

−0.98−0.97

−2 02

−0.96−0.95

−2 02

−0.94 −0.93

−2 02

−0.93

Figure 1: First page of 100-page trellis display of nor-

mal quantile plots by rater. There are 8 columns and

4 rows of panels per page.

For this plot, the 100 pages can be visually scanned in a

few minutes because our visual systems can effortlessly

detect departures of the plotted points on a panel from

the line. Across the plot, as on the first page, the

line follows the pattern of the points which means the

normal is a good approximation of errors.

3.2 Surveillance Dataset

For the surveillance data, the core partitioning is by

emergency department (ED). Because the dataset size

is moderate, the partition sampling is exhaustive for

both numeric and visualization methods. Developing

models and testing results depended heavily on the

visualization methods that populated the VDB, many

applied to each ED separately, but also some showing

results on different panels across EDs.

Figure 2 shows the top two panels of the Page 1 of

a 90-page, 540-panel trellis display; in the full dis-

play there is 1 column and 6 rows on each page.

Figure 3 shows the top two panels of Page2.

the project, new methods for modeling respiratory

counts for each ED are based on STL, the nonparamet-

ric seasonal-trend numeric decomposition procedure.

Square root counts of each are decomposed into inter-

annual, yearly-seasonal, day-of-the-week, and random-

error components. Using this decomposition method,

a new synoptic-scale (days to weeks) numeric outbreak

detection method was developed.

tion method, and a version using a lesser amount of

data, STL(90), were tested along with four existing

and widely know methods: GLM, C1, C2, C3. Each

method was tested on each ED count series. An out-

break occurrence was added on a particulary day, the

outbreak methods applied, and detect or non-detect

within 14 days recorded. This was done for each ED

on each day separtately starting with the 366th day of

In

The STL detec-

data. There are 3 different outbreak magnitudes (2,

1.5, and 1), and 30 EDs (with anonymized names such

as Act). With 6 detection methods, 3 outbreak magni-

tudes, and 30 EDs, there are 540 = 6×3×30 outbreak

test sequences across time. Each panel shows one test

sequence for one method, one magnitude and one ED.

The diagnostic display shown partially in Figures 2

and 3 reveals the effect of the seasonal component on

detection performance. The strip labels to the left of

each panel show the outbreak method, magnitude and

ED. Outbreak method changes the fastest, each page

has 6 panels showing the 6 methods for one combina-

tion of magnitude and ED. Magnitude changes next

fastest; on page one at the top of the figure, the value

is 2; on page two at the bottom of the figure, it is 1.5,

and on page three (not shown), it is 1. ED changes

the slowest, so pages one to three are Act, pages four

to six are the next ED, and so forth.

The curve formed by the bottoms of the blue lines

and the tops of the red lines on each panel is the all-

data STL seasonal component; this component is used

because C1, C2, and C3 do not involve a decomposi-

tion, and of the three remaining methods, which use a

decomposition, the all-data STL component does the

best job of tracking the yearly seasonal pattern. Each

vertical line emanating from the curve shows the re-

sults of starting an outbreak on the day at which the

line is drawn and determining if detection occurred on

or before the 14th day of the outbreak; red is detected,

and blue is not detected.

Much can be seen from this display about detection

performance and how it changes with the behavior of

the yearly seasonal component. All methods, as ex-

pected, detect more frequently for magnitude 2 than

1.5. STL is the best performer. STL failure to detect

occurs most frequently during periods of decline in the

seasonal component because STL seasonal on the last

day of data does not keep up with the decline which

reduces the apparent ramping up of the outbreak.

3.3 Connection Dataset

The core partitioning is by connection so there are

749,128 subsets. Analysis associated with the connec-

tion VDB focuses on a numeric statistical method: a

rules-based statistical algorithm (RBSA) that classi-

fies each packet of a connection as a client keystroke

from an ssh connection or not. The algorithm uses the

timestamps, packet payload sizes, and flags in both

directions to carry out the classification. The goal for

network security is to classify the whole connection as

interactive ssh or not. The algorithm is very accurate

at the packet level but does have some misclassifica-

tions; we found that classifying the connection as in-

Page 5

Saptarshi Guha, Ryan Hafen, and Paul Kidwell and William S. Cleveland

Time (years)

C1

2

Act

20062007 2008

STL

2

Act

Figure 2: Top two panels of six of Page 1 of a 90-page, 540-panel trellis display.

Time (years)

C1

1.5

Act

200620072008

STL

1.5

Act

Figure 3: Top two panels of six of Page 2 of a 90-page, 540-panel trellis display.