An Empirical Study of Profiling Strategies for
Released Software and their Impact on Testing Activities
Sebastian Elbaum and Madeline Hardojo
Department of Computer Science and Engineering
University of Nebraska - Lincoln
An understanding of how software is employed in the field
can yield many opportunities for quality improvements. Pro-
filing released software can provide such an understanding.
However, profiling released software is difficult due to the
potentially large number of deployed sites that must be pro-
filed, the extreme transparency expectations, and the re-
mote data collection and deployment management process.
Researchers have recently proposed various approaches to
tap into the opportunities and overcome those challenges.
Initial studies have illustrated the application of these ap-
proaches and have shown their feasibility. Still, the promis-
ing proposed approaches, and the tradeoffs between over-
head, accuracy, and potential benefits for the testing activ-
ity have been barely quantified. This paper aims to over-
come those limitations. Our analysis of 1200 user sessions
on a 155 KLOC system substantiates the ability of field data
to support test suite improvements, quantifies different ap-
proaches previously introduced in isolation, and assesses the
efficiency of profiling techniques for released software and
the effectiveness of their associated testing efforts.
Categories and Subject Descriptors: D.2.5: Testing
General Terms: Experimentation, Reliability, Verifica-
Keywords: Profiling, instrumentation, software deployment,
testing, empirical studies.
Software test engineers cannot predict, much less exer-
cise, the overwhelming number of potential scenarios faced
by their software. Instead, they allocate their limited re-
sources based on assumptions about how the software will
be employed after release. Yet, the lack of connection be-
tween in-house activities and how the software is employed
in the field can lead to inaccurate assumptions, resulting in
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISSTA’04, July 11–14, 2004, Boston, Massachusetts, USA.
Copyright 2004 ACM 1-58113-820-2/04/0007 ...$5.00.
decreased software quality and reliability over the system’s
lifetime. Even if estimations are initially accurate, isolation
from what happens in the field leaves engineers unaware of
future shifts in user behavior or variations due to new envi-
ronments until too late.
Approaches integrating in-house activities with field data
appear capable of overcoming such limitations. These ap-
proaches must profile field data to continually assess and
adapt quality assurance activities, considering each deployed
software instance as a source of information. The increasing
software pervasiveness and connectivity levels1of a constantly-
growing pool of users coupled with these approaches offers
a unique opportunity to gain a better understanding of the
software’s potential behavior.
Early commercial efforts have attempted to harness this
opportunity by including built-in reporting capabilities in
deployed applications that are activated in the presence of
certain failures (e.g, Software Quality Agent from Netscape
, Traceback , Microsoft Windows Error Reporting
API). More recent approaches, however, are designed not
only to leverage the deployed software instances through-
out their execution, but also to consider the levels of trans-
parency required when profiling users’ sites, the manage-
ment of instrumentation across the deployed instances, and
issues that arise from the large scale data collection. For
• The Perpetual Testing project produced the residual
testing technique to reduce the instrumentation based
on previous coverage results [23, 25].
• The EDEM prototype provided a semi-automated way
to collect user-interface feedback from remote sites when
it does not meet an expected criterion .
• The Gamma project introduced an architecture to dis-
tribute and manipulate instrumentation across deployed
software instances .
• The Skoll project presented an architecture and a set
of tools to distribute different job configurations across
Although these efforts present reasonable conjectures, we
have barely begun to quantify their potential benefits and
costs. Most publications have illustrated the application of
isolated approaches , introduced supporting infrastruc-
ture , or explored a technique’s feasibility under par-
ticular scenarios (e.g., [9, 14]). (Previous empirical studies
135 million Americans had broadband Internet access in
are summarized in Section 2.5.) Given that feasibility has
been shown, we must now quantify the observed tendencies,
investigate the tradeoffs, and explore whether the previous
findings are valid at a larger scale. This paper presents a
family of three empirical studies that quantify the efficiency
and effectiveness of profiling strategies for released software.
The studies also assess several techniques that employ field
data to drive test suite improvements and identify factors
that can affect the potential gains.
In the following section, we abstract the essential attributes
of existing profiling techniques for released software to or-
ganize them in strategies and summarize the results of pre-
vious empirical studies. Section 3 describes the research
questions, object preparation, design and implementation,
metrics, and potential threats to validity. Section 4 presents
results and analysis. Section 5 provides additional discus-
sion and conclusions.
2. PROFILING STRATEGIES FOR
Researchers have enhanced the efficiency of profiling tech-
niques through several mechanisms: (1) performing up-front
analysis to optimize the amount of instrumentation required
, (2) sacrificing accuracy by monitoring entities of higher
granularity or by sampling program behavior [7, 8], (3) en-
coding the information to minimize memory and storage re-
quirements , and (4) repeatedly targeting the entities
that need to be profiled . The enumerated mechanisms
gain efficiency by reducing the amount of instrumentation
inserted in a program through the analysis of the properties
it exhibited in a controlled in-house environment.
Profiling techniques for released software [10, 11, 23, 27],
however, must consider the efficiency challenges and the new
opportunities introduced by a remote and potentially large
user pool. This paper investigates three profiling strategies
designed to work on released software. The strategies ab-
stract the essential elements of existing techniques helping
to provide an integrated background, and also facilitating
formalization, analysis of tradeoffs, and comparison of ex-
isting techniques and their implementation.
The next section describes the full strategy which con-
stitutes our baseline, the following three sections describe
the profiling strategies for released software, and Section 2.5
summarizes the related empirical studies.
2.1 Full Profiling
Given a program P and a class of events to monitor C,
this approach generates P?by incorporating instrumenta-
tion code into P to enable the capture of ALL events in C
and a transfer mechanism T to transmit the collected data
to the organization at the end of each execution cycle (e.g.,
after each operation, when a fatal error occurs, or when the
user terminates a session).
Capturing all the events at all user sites demands exten-
sive instrumentation, increasing program size and execution
overhead (reported to range between 10% to 390% [2, 16]).
However, this approach also provides the maximum amount
of data for a given set C serving as a baseline for the analy-
sis of the other three techniques specific to release software.
We investigate the raw potential of field data through the
full technique in Section 4.1.
2.2 Targeted Profiling
Before release, software engineers develop an understand-
ing about the program behavior. Factors such as software
complexity and schedule pressure limit the levels of under-
standing they can achieve. This situation leads to different
levels of certainty about the behavior of program compo-
nents. For example, when testers validate a program with
multiple configurations, they may be able to gain certainty
about a subset of the most used configurations. Some con-
figurations, however, might not be (fully) validated.
To increase profiling transparency in deployed instances,
software engineers may just target the components for which
the behavior is not sufficiently understood. Following with
our example about a software with multiple configurations,
engineers aware of the overhead associated with profiling
deployed software could aim to profile just those least un-
derstood or most risky configurations.
More formally, given program P, a class of events to pro-
file C, a list of events observed (and sufficiently understood)
in-house Cobserved where Cobserved⊂ C, this technique gen-
erates a program P?with additional instrumentation to pro-
file all events in Ctargeted = C − Cobserved. Observe that
|Ctargeted| determines the efficiency of this technique by re-
ducing the necessary instrumentation, but also bounding
what can be learned from the field instances. As the cer-
tainty about the behavior of the system or its components
diminishes, |Cobserved| → 0, this strategy’s performance ap-
As defined, this strategy includes the residual testing tech-
nique  where Cobserved corresponds to statements ob-
served in-house, and it also includes the distribution scheme
proposed by Skoll  where Ctargetedcorresponds to target
software configurations that require field testing.
2.3 Profiling with Sampling
Statistical sampling is the process of selecting a suitable
part of a population for determining the characteristics of
the whole population. Profiling techniques have adapted
the sampling concept to reduce execution costs by repeat-
edly sampling across space and time. Sampling across space
consists of profiling a subset of the events following a certain
criterion (e.g., hot paths). Sampling across time consists of
obtaining a sample from the population of events at certain
time intervals . Common profiling utilities like gprof fol-
low this approach, stopping execution at fixed intervals to
determine where the cycles are being spent .
When considering released software, we find an additional
sampling dimension: the instances of the program running
in the field. We could, for example, sample across the popu-
lation of deployed instances, profiling the behavior of a group
of users. This is advantageous because it would only add the
profiling overhead to a subset of the population. Still, the
overhead on this subset of instances could substantially af-
fect user’s activities, biasing the collected information.
An alternative sampling mechanism could consider multi-
ple dimensions. For example, we could stratify the popula-
tion of events to be profiled following a given criterion (e.g.,
events from the same functionality) and then sample across
the subgroups of events, generating a version of the program
with enough instrumentation to capture just those sampled
events. Then, by repeating the sampling process, different
versions of P?can be generated for distribution. Potentially,
This work was supported in part by a NSF-ITR program
under award 0080898 and a CAREER Award 0347518 to
University of Nebraska, Lincoln. We especially thank the
users who volunteered for the study and the participants of
the First Workshop on Remote Analysis and Measurement
of Software Systems for their feedback on earlier stages of
this work. We also thank Gregg Rothermel and Michael
Ernst for providing feedback on earlier versions of this pa-
 M. Arnold and B. Ryder. A framework for reducing
the cost of instrumented code. In Conference on
Programming Language Design and Implementation,
pages 168–179, 2001.
 T. Ball and J. Larus. Optimally profiling and tracing
programs. ACM Transactions on Programming
Languages and Systems, 16(4):1319–1360, 1994.
 T. Ball and J. Laurus. Optimally profiling and tracing
programs. In Annual Symposium on Principles of
Programming, pages 59–70, Aug. 1992.
 B. Calder, P. Feller, and A. Eustace. Value profiling.
In International Symposium on Microarchitecture,
pages 259–269, Dec. 1997.
 S. Elbaum, S. Kanduri, and A.Andrews. Anomalies as
precursors of field failures. In International
Symposium of Software Reliability Engineering, pages
 S. Elbaum, A. G. Malishevsky, and G. Rothermel.
Test case prioritization: A family of empirical studies.
IEEE Transactions on Software Engineering,
28(2):159–182, Feb. 2002.
 A. Glenn, T. Ball, and J. Larus. Exploiting hardware
performance counters with flow and context sensitive
profiling. ACM SIGPLAN Notices, 32(5):85–96, 1997.
 S. Graham and M. McKusick. Gprof: a call graph
execution profiler. ACM SIGPLAN, 17(6):120–126,
 K. Gross, S. McMaster, A. Porter, A. Urmanov, and
L. Votta. Proactive system maintenance using
software telemetry. In Workshop on Remote Analysis
and Monitoring Software Systems, pages 24–26, 2003.
 M. Harrold, R. Lipton, and A. Orso. Gamma:
Continuous evolution of software after deployment.
 D. Hilbert and D. Redmiles. An approach to
large-scale collection of application usage data over
the Internet. In International Conference on Software
Engineering, pages 136–145, 1998.
 D. Hilbert and D. Redmiles. Separating the wheat
from the chaff in internet-mediated user feedback,
 InCert. Rapid failure recovery to eliminate application
downtime. www.incert.com, June 2001.
 J.Bowring, A. Orso, and M. Harrold. Monitoring
deployed software using software tomography. In
Workshop on Program analysis for software tools and
engineering, pages 2–9, 2002.
 D. Libes. Exploring Expect: A Tcl-Based Toolkit for
Automating Interactive Programs. O’Reilly &
Associates, Inc., Sebastopol, CA, Nov. 1996.
 B. Liblit, A. Aiken, Z. Zheng, and M. Jordan. Bug
isolation via remote program sampling. In Conference
on Programming Language Design and
Implementation, pages 141–154. ACM, June 2003.
 J. Musa. Software Reliability Engineering.
McGraw-Hill, New York, NY, 1999.
 Netscape. Netscape quality feedback system.
 Nielsen. Nielsen net ratings: Nearly 40 million
Internet users connect via broadband.
 U. of Washington. Pine information center.
 A. Orso, T. Apiwattanapong, and M.J.Harrold.
Leveraging field data for impact analysis and
regression testing. In PFoundations of Software
Engineering, pages 128–137. ACM, September 2003.
 A. Orso, D. Liang, M. Harrold, and R. Lipton.
Gamma system: Continuous evolution of software
after deployment. In International Symposium on
Software Testing and Analysis, pages 65–69, 2002.
 C. Pavlopoulou and M. Young. Residual Test
Coverage Monitoring. In International Conference of
Software Engineering, pages 277–284, May 1999.
 S. Reiss and M. Renieris. Encoding Program
Executions. In International Conference of Software
Engineering, pages 221–230, May 2001.
 D. Richardson, L. Clarke, L. Osterweil, and M. Young.
Perpetual testing project.
 A. van der Hoek, R. Hall, D. Heimbigner, and A. Wolf.
Software release management. In M. Jazayeri and
H. Schauer, editors, European Software Engineering
Conference, pages 159–175. Springer–Verlag, 1997.
 C. Yelmaz, A. Porter, and A. Schmidt. Distributed
continuous quality assurance: The Skoll project. In
Workshop on Remote Analysis and Monitoring
Software Systems, pages 16–19, 2003.