Variable Binned Scatter Plots
Ming C. Hao, Umeshwar Dayal, Ratnesh K. Sharma Daniel A. Keim, Halldór Janetzko
Hewlett Packard Laboratories, Palo Alto, CA University of Konstanz, Germany
The scatter plot is a well-known method of visualizing pairs of two continuous variables. Scatter plots are
intuitive and easy-to-use, but often have a high degree of overlap which may occlude a significant portion of the
data. To analyze a dense non-uniform dataset, a recursive drill-down is required for detailed analysis. In this
paper, we propose variable binned scatter plots to allow the visualization of large amounts of data without
overlapping. The basic idea is to use a non-uniform (variable) binning of the x and y dimensions and to plot all
data points that are located within each bin into the corresponding squares. In the visualization, each data point
is then represented by a small cell (pixel). Users are able to interact with individual data points for record level
information. To analyze an interesting area of the scatter plot, the variable binned scatter plots with a refined
scale for the subarea can be generated recursively as needed. Furthermore, we map a third attribute to color to
obtain a visual clustering. We have applied variable binned scatter plots to solve real-world problems in the
areas of credit card fraud and data center energy consumption to visualize their data distributions and cause-
effect relationships among multiple attributes. A comparison of our methods with two recent scatter plot
variants is included.
Keywords: Variable Binned Scatter Plots, Correlations, Patterns, Cause-Effect, Data Distribution
Scatter plot are one of the most powerful tools for data analysis in daily business operations. Analysts face the
challenge of understanding underlying data and finding important relationships from which to draw conclusions,
such as answering questions on how one variable is affected by another. For example, in credit card fraud
analysis, business analysts want to know fraud impact factors (i.e., amount, count, and region) and distributions.
Data center managers want to find the cause-effect of resource consumption to increase energy savings. A
scatter plot of power consumptions against temperature can show the impact between these two variables for
administrators to improve cooling efficiency.
Scatter plots are widely used, intuitive, and easy to understand. However, scatter plots often have a high number
of overlapping data points. When there are many data points and significant overlap, scatter plots become less
useful. Besides, in analyzing a very dense dataset, the current scatter plots are too coarse and recursion in certain
areas is needed. For example, the traditional scatter plot in Figure 1A shows 70,465 fraud observations, but only
about 200 distinct data points are visible in the scatter plot, which may mislead the user in judging the density of
the data. In addition, users can not distinguish the details in the left bottom corner. There are several approaches
that can be used when this occurs (for details see section 2). But the difficulties still remain, especially when
visualizing very large multi-dimensional high density data sets. Current scatter plots do not provide a complete
picture of the data regarding:
Detailed information at record level.
Overlapping data points.
Distribution and clusters in the high density areas.
In order to visualize the data distributions and discover cause-effect between attributes (variables), we have to
solve the overlap problem. Our solution is the variable binned scatter plot. First, the dataset is binned into
proper value ranges. Then, we use density estimation with distortion techniques to place overlapping data points
that fall within each bin into corresponding square. The bin size is variable and is computed from the data
density. The degree of variation is optimized based on the number of overlapping data points and the available
Further, our variable binned scatter plot uses the value of a third attribute as the color of the data points. With
the color, data points can be classified into different groups (clusters). This feature is especially useful in placing
overlapping data points by certain categories (i.e., sales regions). Overlapping data points are sorted by the
value of the third attribute and then placed together to form clusters. Binned scatter plots can be extended into a
variable binned scatter plot matrix to display pair-wise relations between multiple attributes.
Each data point is represented by a pixel [10, 11]. Because a pixel is the smallest element on the screen, large
volumes of data points can be displayed in a single view. Variable binned scatter plots are interactive. Analysts
can rubber-band an interesting area and zoom into detailed information. Using a recursive drill down, users are
able to select one or multiple bins in the variable binned scatter plot to generate a new variable binned scatter
plot with refined data value range of x-axis and y-axis for detailed analysis.
Variable binned scatter plots have been applied with success to real-world credit card fraud analysis and data
center thermal management applications. Both applications use variable binned scatter plots to visualize data
distributions and impacts among various factors.
This paper is structured as follows: Section 2 provides an overview of related work. Section 3 introduces the
variable binned scatter plots basic idea and three basic techniques: pixel cell-based representation, binning, data
point placement and grouping, and recursive scatter plot generation. In section 4, we present application
examples in which real-world data are used to demonstrate the effectiveness of our techniques. An evaluation of
the strengths and weaknesses of our approach versus other variants is presented in section 5.
2. RELATED WORK
The scatter plot is a well-known data analysis method to show how much one variable is affected by another.
Overlap is always a problem in visualizing high density data sets using scatter plots. In 1984, Cleveland 
introduced sunflowers to draw overlapping points and superposition of smoothing methods for enhancing the x-
and y-axes in scatter plots. Cleveland’s ideas are great improvements of scatter plots, but they do not solve the
overlap problem. In 1999, Lee Wilkinson  suggested the usage of semi-transparency to make overlapping
data points partially visible.
In the book by Antony Unwin et al. , a number of interesting visualization techniques were introduced
regarding scatter plots, such as drawing overlap points with slightly bigger sizes and reducing the x and y axes
by certain factors. JMP 8 Software  generates scatter plots with nonparametric density contours and marginal
distributions to show where the data is most dense. Each contour line in the curved shape encloses 5% of the
data. Carr  uses a hexagonal-shaped symbol whose size increases monotonically as the number of
observations in the associated bin increases, and HexBin scatter plots  determine the brightness value of each
HexBin cell depending on the number of data points in the cell. All three techniques, Unwin’s distortion, Carr’s
binning and the HexBin visualization techniques, are close to the method presented in this paper. Bowman’s
smooth contour scatter plot [2, 3] applies smoothing techniques for data analysis. Bachthaler’ continuous scatter
plots  are different from the above scatter plots. Continuous scatter plots are used for visualizing spatially
continuous input data instead of discrete data values.
The above approaches provide excellent methods for data correlation analysis. However, analysts are not able to
see and access all data points, especially if the third variable mapped to color is of high importance. In this
paper, we introduce the variable binned scatter plot technique with recursive visual analytics. We combine the
best features of the above methods (e.g., binning and zooming) with distortion to find the best placement for the
overlapping data points to enable analysts to quickly discover distribution and clusters. Also, we use color to
visualize the third attribute, while the previous approaches use color to represent density. This feature helps the
users to quickly identify patterns and clusters.
3. OUR APPROACH
3.1 Basic Idea of Variable Binned Scatter Plots
Binning is an efficient approach  to reduce the complexity of large volumes of multi-dimensional data by
dividing the plot area into a number of value ranges. We introduce the concept of the variable binned scatter plot
to manage large volumes of data that overlap. Variable binned scatter plots are derived from traditional scatter
plots to address the issue of overlapping. Variable binned scatter plots group data in a two-dimensional space
based on the densities of pairs of variables. Each data point defined as (xi, yi, zi) for i from 1 to n consists of a
pair of two variables, x and y. A scatter plot of xi against yi shows the relationship between x and y; and color z
to show a third variable. Variable binned scatter plots employ color (zi) to cluster related data points.
Figures 1A to 1D illustrate the progression from traditional scatter plots to variable binned scatter plots. The
scatter plot in Figure 1A has many overlapping points. There is no indication as to which data points are
overlapping, potentially resulting in a misleading data representation. In Figure 1B, the overlapping data points
use the color to denote the number of data points which have the same (xi, yi) position, but the plot area is too
dense to see all the colored data points. In the scatter plot in Figure 1C, the first bin is slightly enlarged resulting
in less overlap. More data points become visible, but it is still not possible to see all data points with their
distributions and patterns.
Variable binned scatter plot in Figure 1D uses a non-uniform (variable) binning of the x and y dimensions and
plots all the data points that fall within each bin into the corresponding square area. These square areas are
scaled to allow each data point to be shown without any overlap. The relative position of a data point within a
bin is retained as accurately as possible. Users are now able to visualize all the data points without losing
information. Users are able to visualize the impact between two variables accurately and quickly, and without
misrepresentations of the data. Variable binned scatter plots enhance the traditional scatter plots in analyzing
very large and dense datasets.
3.2 Construction of Variable Binned Scatter Plots
Figure 2 illustrates a pipeline on how to construct a variable binned scatter plot using the following techniques:
1) Use of pixel cells to represent data points in a binned scatter plot
Variable binned scatter plots use the smallest element on the screen, such as a pixel, to represent a data
point. Analysts are able to view large volumes of data points in a single display. Data points of binned
scatter plots are interactive. Analysts can zoom-in on a data point to view specific data attributes.
Intelligent visual queries  are also provided for analysts to select a focused area in a scatter plot.
Figure 1: From Traditional Scatters Plot to Variable Binned Scatters Plot Without Overlapping
(x-axis: fraud count, y-axis: fraud amount, color: #of overlapping data points/fraud $amount)
Then apply automated analysis methods to identify characteristics of the selected data as well as their
relationships to other attributes and data points.
2) Binning & Distortion
Based on Carr et al.’s definition , binning is an approximation for density of the joint distribution of
two variables. Our variable binned scatter plots use non-uniform bin sizes to display high density areas.
A bin contains the data points which have their (x, y) -coordinates within defined x and y value ranges.
The binning of the x- and y-axes is determined according to the data value ranges which are computed
from the incoming data and their density distribution. When the data size grows, the bin size will be
enlarged accordingly. Therefore, there are no overlapping data points in the variable binned scatter
plots. The following illustrates the overall binning algorithm using a non-uniform graphical density
Determine the density distribution and value ranges in x and y directions.
Assign the number of bins in x and y directions and compute the bin size based on data
density distributions and value ranges.
Determine bin width according to the total window width divided by the number of bins on
Determine bin height for each row according to the maximum number of data points of all
bins in the corresponding row.
In our current application, the bins on the x-axis use equal widths based on the window size. The bins
on the y-axis have different heights according to the maximum number of data points within the bins in
the row. For example, in a fraud dataset, a data point P with (x, y) = (172K, 35M) is positioned within
the bin (20K-525.9K, 1M-66.55M) where x=20K-525.9K and y=1M-66.55M in Figures 3 and 5.
Figure 3: From Tradition Scatter Plots with Overlapping to Variable Binned Scatter Plots without Overlapping
Figure 2: A Variable Binned Scatter Plots Construction Pipe Line
3) Placement and Grouping
Variable binned scatter plots place the non-overlapping data points according to their x and y
coordinates within the corresponding bins. The overlapping data points are sorted according to the
value of the third attribute to form groups in two-dimensional space. The placement algorithm uses the
available space around the already occupied data points to compute the best location for the data points
that would otherwise be overlapping in a traditional scatter plot. Data points with the same x and y
coordinates are sorted and placed in nearby neighborhood according to the similarity of the third
Figure 3A illustrates 11 data points with the same (x, y) coordinates. The data point P is overlapped by
the data points P1 through P10. Overlap causes two problems in visualizing data distributions and
patterns: (1) the number of overlaid data points is unknown and (2) the value of overlaid data points is
not visible. Figure 3B shows how to place the overlapping data points to form a square group around
the data point P ordered by the third variable values. If the neighborhood position is already occupied
(such as two data points are conflict/interest each other), then the bin axes will be proportionally
enlarged and will push the already occupied data points away along the x (toward right) and y (toward
top) directions. As the result of this placement process, a red square for sales region 6 is constructed
from the 11 overlapped data points.
4) Recursive Variable Binned Scatter Plots
Users can select one or more adjacent bins to construct a subset of the variable binned scatter plot. The
algorithm uses the data points from the selected bins to generate a new variable binned scatter plot for
the selected data points with a new binning adapted to the data distribution of the selected data. The (x,
y) coordinates are defined by the value range of the data points from the selected bins. The color
remains the same. The scale is computed with a refined scale based on the new data ranges. Figure 4
shows a sequence of recursive variable binned scatter plots (plots #1, #2 and #3) in a credit card fraud
analysis application (section 4.1), generated by the user rubber-banding a high density area (x=1-100,
y=2K-10K, color=region). In the variable binned scatter plots#2, users are able to detect the correlation
between fraud amount and fraud count using the refined scale from fraud amounts in the ranges 2K,
2.2K, 2.5K…, to10K. The analyst can further select another group of bins in plot #2 to compare fraud
distribution in the refined data ranges, such as x=2k-5k and y=1-20. The result is shown in the plot #3.
Figure 4: A Credit Card Fraud Analysis Variable Binned Scatter Plot
(x-axis: Fraud Count, y-axis: Fraud Amount, Color: Region 1-6)
4.1 Credit Card Fraud Analysis
Fraud is one of the major problems faced by many companies in the banking, insurance, and telephony
industries. Large volumes of dollars in fraudulent transactions are processed yearly on credit card payments.
Transforming raw transaction data into valuable business intelligence to support fraud analysis will save
companies millions of dollars. Fraud analysis specialists require visual analytics tools that help them to better
understand fraud behavior, geographical locations, and correlated factors as well as identify exceptions.
Typical questions in fraud analysis are:
Q1. What is the fraud distribution and which are the most correlated attributes?
Q2. Are there any outliers and what are their causes?
Q3. Which sales regions and purchase amount have the most fraud?
Plot#1 in Figure 4 shows a binned scatter plot with 70,465 fraud records. Analysts use it to analyze fraud
distributions and correlations among different attributes (i.e., amount, count, and region) to answer the first
question. In a variable binned scatter plot, each fraud data point is represented by a pixel. Because there are no
overlapping data points in a variable binned scatter plot, analysts are able to visualize fraud distribution at each
data point along the x and y directions. The binning of the x and y direction is determined according to the fraud
amount and fraud count. The color of a data point represents the sales region (1-6) where the fraud occurred.
Plot#1 shows that the fraud amount is almost increases linearly with fraud count. The fraud amount is highly
impacted by the fraud count. However, there is an outlier at the top left bin (x=1-100, y=1M-66.55M) with a
low fraud count of 5 but a very high fraud amount of $ 28.107M that occurred in region 6 (red). To answer the
second question, analysts can drill down to the credit card payment which might be a potential problem or error.
Data points that would be overlapped in traditional scatter plots are now represented as clusters distributed
inside a bin. Bin(x=1-100, y=1-1k) has three large clusters from regions 1-5 (orange, yellow, and green)
In order to answer the third question on finding which sales region in which fraud amount range (bin) has the
most fraud, we optimize data point placement, so that data points from the same sales regions (colors) are placed
together (the technique is described in Section 3) for analysts to see the fraud regional distribution. There are
three large clusters (orange, yellow, and green) in bin (x=1-100, y=1-1K). Sales region 1 (green) has the lowest
fraud amount and count. The smallest cluster is sales region 6 (red) but with the highest fraud amount and count.
To find which fraud amount and region have the most fraud, analysts can first select a bin, such as bin (x=20K-
525.9K, y=1M-66.55) and then recursively drilldown to generate the binned scatter plots #2 and #3 with refined
scales. From plot #2 and plot#3, analysts can learn that the most frauds came from region 6 (red) and the
purchase amount is above $50M. Using the above information, the company is able to place strict control on
certain sales regions and purchase amount, such as sales region 6 and a purchase amount above $1M.
4.2 Data Center Thermal Monitoring
Cooling is the major operational cost in a data center. The chiller consumes over 600 KW of power in order to
keep a normal temperature for the daily IT load in a data center with 500 racks and 11 air conditioning units.
Chillers consume power to extract heat from the warm water and provide cold water to the air conditioning units
to keep the data center temperature cool. Visual monitoring of the utilization of chillers and power consumption
and their impacts on temperatures can greatly reduce operating expenses and equipment downtime.
Questions of a data center service manager’s frequent concern are:
Q1. What is our daily power consumption? How do we optimize the cooling system performance?
Q2. How is the chiller operating? Are there any problems?
Q3 What are the cause-effects of the power consumption on temperature?
To answer the above three questions, we have used variable binned scatter plots to enable administrators to
visualize the relationships and impacts among these three thermal factors: temperature, power consumption, and
chiller utilization. Figure 5A shows a time series scatter plot. Each data point is represented by a pixel and
defined with three attributes (x, y, color) where the x-axis represents the time line from 9/3 to 9/8. The y-axis
represents temperature. The color of the data point represents the power consumption which runs the chiller,
from low (green) to medium (yellow, orange) to high (red). With variable binned scatter plots, the overlapping
data points are sorted and placed close to their original locations as described in Section 3.
5A: Data Center Temperature Time Series Visual Analysis using Variable Binned Scatter Plots
(x-axis: days (9/3 to 9/8, y-axis: Temperature OF Power Consumption KW, color: Power Consumption KW)
5B: Power Consumption Visual Analytics using Recursive Binned Scatter Plots
Generated from the three rubber-band areas in Figure 5A
(x-axis: days 9/3 to 9/5, y-axis: Temperature OF, color: Power Consumption KW)
5C: Visual Analysis of Hourly Temperature and Power Cost-Effect using Recursive Binned Scatter Plots
Generated from the 9/3 time series in Figure 5A
(x-axis: day 9/3, y-axis: Temperature OF, color: Power Consumption KW)
Figure 5: Data Center Thermal Monitoring Variable Binned Scatter Plots
Figure 5A shows, in weekdays (9/3, 9/4, and 9/5), the power consumption is higher during the day and early
evening (more orange and red), than during the early morning and late evening (mostly green). This result helps
the administrators to optimize cooling system performance. During weekends (9/6 and 9/7), less power is used
(most green and yellow). But there is an exception is found on 9/8 (Monday). The temperature remains high
(above 62 0F) and even the power consumption is above 500 KW, between 10 am to 8 pm.
Recursive Drilldown in Variable Binned Scatter Plots
To answer the second question on how the chiller is operating, the analysts can rubber-band the three high
power consumption areas (on day 9/3 to 9/5, hours 13-18, temperature 520F - 550F) to generate a recursive plot
from the top graph in Figure 5B.
Figure 5B shows the generated scatter plot which has a different distance scale. The plot contains three
different power consumption patterns (A, B, and C). In order to keep their high temperature under 54.60F, the
administrator has to turn on both chillers with power consumption above 500 KW. Administrators can quickly
find that pattern B consumes more power than patterns A and C (more red data points). In addition,
administrators notice that pattern C has different power consumption pattern form patterns A and B. Pattern C
has less data points under 530F. Administrators can use this information to identify impact of energy efficiency
measures within the data center.
To answer the third question on the cause-effects of the power consumption on temperature, the administrator
can select the entire bin on day 9/3 as shown in the Figure 5A and generate a recursive variable binned scatter
plot. The resulting plot is shown in Figure 5C.
The power consumption is higher during the day and early evening (more orange and red); than during the early
morning and late evening (mostly green and yellow). This result helps the administrators to optimize cooling
system performance. From this observation, administrators are able to use less power (green, under 250 KW) in
the early morning and late evening and then gradually increase the power (yellow and orange, over 300 KW)
between 10 am to noon. Especially during the peak hours 2 pm to 6 pm, power could be increased greater than
500 KW (red) for the chiller to cool down the temperature to less than 55oF.
There are many well-known variations of the traditional scatter plot that try to solve the overlap problem of
scatter plots. The HexBin  and smoothed contour scatter plots [9, 10] are two recent variants which are also
available in the R statistics software. We will address the question “Can the HexBin and smooth contour scatter
plots achieve the same results as our variable binned scatter plot?
Figures 6A to 6D shows the HexBin and smoothed contour scatter plots with the same number of fraud records
(70,465) and the same data center resource consumption data (43,204) as the variable binned scatter plot shown
in Figures 6E and 6F. An evaluation of the strengths and weaknesses of the three approaches follows.
The strengths of the variable binned scatter plot include:
1. Variable binned scatter plots show fraud and thermal distribution in the high density areas, marked by
the dashed rectangle (Figures 6E and 6F) more clearly than either HexBin scatter plot (Figures 6A and
6B) or the smooth contour scatter plot (Figure 6C and 6D). Smoothed contour scatter plots show
linearly increasing overlaps with different shades which are better than HexBin scatter plots.
In most applications, the majority of data points occur in the high density areas. Both variants require
an extra step of zooming into these areas (i.e., dashed rectangles). The variable binned scatter plots
provide a big picture of the entire distribution without additional drilldown. Furthermore, the variable
binned scatter plot can quickly identify clusters as well as reveal hidden structures in the dense areas.
2. Variable binned scatter plots map the value of a third attribute to color in order to visualize the extra
dimension by clustering data points as shown in Figure 6F (i.e., power KW). In the HexBin scatter plot,
it is not possible to use color to represent a different attribute at same time. Variable binned scatter
plots have one more dimension to use than HexBin and smooth contour scatter plots allowing the third
attribute to be visible in the same scatter plot.
3. Since the data is aggregated in HexBin and smooth contour scatter plots, it is not possible for users to
interact with a data point for detailed information. All data points in variable binned scatter plots are
accessible and readily viewable.
4. Variable binned scatter plots provide a recursive drill down capability which allows analysts to view
detailed information by using refined scales.
Figure 6: Evaluation of HexBin (6A, 6B) and Smoothed Contour (6C, 6D) Scatter Plots
with Variable Binned Scatter Plots (6E, 6F)
10 Download full-text
The weaknesses of the variable binned scatter plot include:
1. The HexBin and smooth contour scatter plots show a better trend line than the variable binned scatter
plot. Trend line is visible with the variable binned scatter plots, but requires users to follow the bins
2. The HexBin and smooth contour scatter plots use different shading to visualize data density while the
variable binned scatter plots introduce some distortion to visualize all data points.
In summary, both HexBin and smooth contour scatter plots are able to provide a quick overview of data density
and correlations. Variable binned scatter plots visualize the entire data distribution but also allow access to each
individual data point for users to retrieve information at the record level. These three variants of scatter plots
complement one another.
In this paper, we introduce variable binned scatter plots with recursive drill down for a visual analysis of data
distributions and cause-effect of multi-dimensional data in addition to correlation between pairs of variables.
Variable binned scatter plots resolve the overlapping issue and allow users to visually analyze large data sets at
the record level. A minimal distortion is introduced to provide space for the overlapping data points. We are
able to use color for a third attribute of the data to help analysts quickly identify patterns and clusters. The
recursive drill down capability of the variable binned scatter plots provides detailed information with refined
scales for analysts to further analyze important areas in a plot. An evaluation of the recent HexBin and smooth
contour scatter plots demonstrates the benefits of using variable binned scatter plots to reveal clusters and
distribution in a high density area. Our future work will be in the area of visual prediction using scatter plots to
study the trends.
The authors wish to thank Meichun Hsu for her encouragements, and thank Alex Zhang, Manish Marwah and
Cullen E. Bash for providing comments and suggestions.
 Bachthaler S. and Weiskopf D. Continuous Scatterplots, IEEE Transactions on Visualization and Computer
Graphics, Vol. 14, No. 6, November/December 2008.
 Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: the Kernel
Approach With S-Plus Illustrations. Oxford University Press, Oxford.
 Bowman, A.W. and Azzalini, A. (2003). Computational aspects of nonparametric smoothing with
illustrations from the sm library. Computational Statistics and Data Analysis, 42, 545–560.
 Carr, D. B., Littlefield, R. J., Nicholson W. L., and Kuttkefuekdm J. S. Scatterplot Matrix Techniques for
Large N, “Journal of American Statistics Association” 82, 424-436. 1987.
 Cleveland W. S., The Many Faces of a Scatterplot. Robert McGill Journal of the American Statistical
Association, Vol. 79, No. 388 (Dec. 1984), pp 807-822.
 Hao, M., Dayal, U., Keim, D. A., and Morent, D., Intelligent Visual Analytics Queries. IEEE Symposium on
Visual Analytics Science and Technology, pp. 91-98, 2007.
 Hao, M., Dayal, U., Method for visualizing graphical data sets having a non-uniform graphical density for
display. US patent number 7,046,247 issued in May, 2006.
 HexBin Scatter Plot released by R System in January, 2009, https://stat.ethz.ch/pipermail/r-help/2009,
 JMP 8 Software. www.jmp.com/software, new 64-bit computers and visual analytics tools.
 Keim, D. A.: Designing Pixel-oriented Visualization Techniques: Theory and Applications, IEEE
Transactions on Visualization and Computer Graphics (TVCG), Vol. 6, No. 1, pp. 59-78, January-
 Keim, D. A., Kriegel, H. P., and Ankerst, M. Recursive pattern: A Technique for Visualizing Very Large
Amounts of Data. Proc. IEEE Visualization, pp. 279-286, 1995.
 Unwin A., Martin T., Heike H, Graphics of Large Datasets, Springer, NY, 2006, pp 39-193.
 Wilkirson, L. The grammar of Graphics, New York, Springer, 1999. Singh, Mala, MrExcel, Ohio, USA.