An Empirical Study of Profiling Strategies for
Released Software and their Impact on Testing Activities
Sebastian Elbaum and Madeline Hardojo
Department of Computer Science and Engineering
University of Nebraska - Lincoln
An understanding of how software is employed in the field
can yield many opportunities for quality improvements. Pro-
filing released software can provide such an understanding.
However, profiling released software is difficult due to the
potentially large number of deployed sites that must be pro-
filed, the extreme transparency expectations, and the re-
mote data collection and deployment management process.
Researchers have recently proposed various approaches to
tap into the opportunities and overcome those challenges.
Initial studies have illustrated the application of these ap-
proaches and have shown their feasibility. Still, the promis-
ing proposed approaches, and the tradeoffs between over-
head, accuracy, and potential benefits for the testing activ-
ity have been barely quantified. This paper aims to over-
come those limitations. Our analysis of 1200 user sessions
on a 155 KLOC system substantiates the ability of field data
to support test suite improvements, quantifies different ap-
proaches previously introduced in isolation, and assesses the
efficiency of profiling techniques for released software and
the effectiveness of their associated testing efforts.
Categories and Subject Descriptors: D.2.5: Testing
General Terms: Experimentation, Reliability, Verifica-
Keywords: Profiling, instrumentation, software deployment,
testing, empirical studies.
Software test engineers cannot predict, much less exer-
cise, the overwhelming number of potential scenarios faced
by their software. Instead, they allocate their limited re-
sources based on assumptions about how the software will
be employed after release. Yet, the lack of connection be-
tween in-house activities and how the software is employed
in the field can lead to inaccurate assumptions, resulting in
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISSTA’04, July 11–14, 2004, Boston, Massachusetts, USA.
Copyright 2004 ACM 1-58113-820-2/04/0007 ...$5.00.
decreased software quality and reliability over the system’s
lifetime. Even if estimations are initially accurate, isolation
from what happens in the field leaves engineers unaware of
future shifts in user behavior or variations due to new envi-
ronments until too late.
Approaches integrating in-house activities with field data
appear capable of overcoming such limitations. These ap-
proaches must profile field data to continually assess and
adapt quality assurance activities, considering each deployed
software instance as a source of information. The increasing
software pervasiveness and connectivity levels1of a constantly-
growing pool of users coupled with these approaches offers
a unique opportunity to gain a better understanding of the
software’s potential behavior.
Early commercial efforts have attempted to harness this
opportunity by including built-in reporting capabilities in
deployed applications that are activated in the presence of
certain failures (e.g, Software Quality Agent from Netscape
, Traceback , Microsoft Windows Error Reporting
API). More recent approaches, however, are designed not
only to leverage the deployed software instances through-
out their execution, but also to consider the levels of trans-
parency required when profiling users’ sites, the manage-
ment of instrumentation across the deployed instances, and
issues that arise from the large scale data collection. For
• The Perpetual Testing project produced the residual
testing technique to reduce the instrumentation based
on previous coverage results [23, 25].
• The EDEM prototype provided a semi-automated way
to collect user-interface feedback from remote sites when
it does not meet an expected criterion .
• The Gamma project introduced an architecture to dis-
tribute and manipulate instrumentation across deployed
software instances .
• The Skoll project presented an architecture and a set
of tools to distribute different job configurations across
Although these efforts present reasonable conjectures, we
have barely begun to quantify their potential benefits and
costs. Most publications have illustrated the application of
isolated approaches , introduced supporting infrastruc-
ture , or explored a technique’s feasibility under par-
ticular scenarios (e.g., [9, 14]). (Previous empirical studies
135 million Americans had broadband Internet access in
are summarized in Section 2.5.) Given that feasibility has
been shown, we must now quantify the observed tendencies,
investigate the tradeoffs, and explore whether the previous
findings are valid at a larger scale. This paper presents a
family of three empirical studies that quantify the efficiency
and effectiveness of profiling strategies for released software.
The studies also assess several techniques that employ field
data to drive test suite improvements and identify factors
that can affect the potential gains.
In the following section, we abstract the essential attributes
of existing profiling techniques for released software to or-
ganize them in strategies and summarize the results of pre-
vious empirical studies. Section 3 describes the research
questions, object preparation, design and implementation,
metrics, and potential threats to validity. Section 4 presents
results and analysis. Section 5 provides additional discus-
sion and conclusions.
2. PROFILING STRATEGIES FOR
Researchers have enhanced the efficiency of profiling tech-
niques through several mechanisms: (1) performing up-front
analysis to optimize the amount of instrumentation required
, (2) sacrificing accuracy by monitoring entities of higher
granularity or by sampling program behavior [7, 8], (3) en-
coding the information to minimize memory and storage re-
quirements , and (4) repeatedly targeting the entities
that need to be profiled . The enumerated mechanisms
gain efficiency by reducing the amount of instrumentation
inserted in a program through the analysis of the properties
it exhibited in a controlled in-house environment.
Profiling techniques for released software [10, 11, 23, 27],
however, must consider the efficiency challenges and the new
opportunities introduced by a remote and potentially large
user pool. This paper investigates three profiling strategies
designed to work on released software. The strategies ab-
stract the essential elements of existing techniques helping
to provide an integrated background, and also facilitating
formalization, analysis of tradeoffs, and comparison of ex-
isting techniques and their implementation.
The next section describes the full strategy which con-
stitutes our baseline, the following three sections describe
the profiling strategies for released software, and Section 2.5
summarizes the related empirical studies.
2.1 Full Profiling
Given a program P and a class of events to monitor C,
this approach generates P?by incorporating instrumenta-
tion code into P to enable the capture of ALL events in C
and a transfer mechanism T to transmit the collected data
to the organization at the end of each execution cycle (e.g.,
after each operation, when a fatal error occurs, or when the
user terminates a session).
Capturing all the events at all user sites demands exten-
sive instrumentation, increasing program size and execution
overhead (reported to range between 10% to 390% [2, 16]).
However, this approach also provides the maximum amount
of data for a given set C serving as a baseline for the analy-
sis of the other three techniques specific to release software.
We investigate the raw potential of field data through the
full technique in Section 4.1.
2.2 Targeted Profiling
Before release, software engineers develop an understand-
ing about the program behavior. Factors such as software
complexity and schedule pressure limit the levels of under-
standing they can achieve. This situation leads to different
levels of certainty about the behavior of program compo-
nents. For example, when testers validate a program with
multiple configurations, they may be able to gain certainty
about a subset of the most used configurations. Some con-
figurations, however, might not be (fully) validated.
To increase profiling transparency in deployed instances,
software engineers may just target the components for which
the behavior is not sufficiently understood. Following with
our example about a software with multiple configurations,
engineers aware of the overhead associated with profiling
deployed software could aim to profile just those least un-
derstood or most risky configurations.
More formally, given program P, a class of events to pro-
file C, a list of events observed (and sufficiently understood)
in-house Cobserved where Cobserved⊂ C, this technique gen-
erates a program P?with additional instrumentation to pro-
file all events in Ctargeted = C − Cobserved. Observe that
|Ctargeted| determines the efficiency of this technique by re-
ducing the necessary instrumentation, but also bounding
what can be learned from the field instances. As the cer-
tainty about the behavior of the system or its components
diminishes, |Cobserved| → 0, this strategy’s performance ap-
As defined, this strategy includes the residual testing tech-
nique  where Cobserved corresponds to statements ob-
served in-house, and it also includes the distribution scheme
proposed by Skoll  where Ctargetedcorresponds to target
software configurations that require field testing.
2.3 Profiling with Sampling
Statistical sampling is the process of selecting a suitable
part of a population for determining the characteristics of
the whole population. Profiling techniques have adapted
the sampling concept to reduce execution costs by repeat-
edly sampling across space and time. Sampling across space
consists of profiling a subset of the events following a certain
criterion (e.g., hot paths). Sampling across time consists of
obtaining a sample from the population of events at certain
time intervals . Common profiling utilities like gprof fol-
low this approach, stopping execution at fixed intervals to
determine where the cycles are being spent .
When considering released software, we find an additional
sampling dimension: the instances of the program running
in the field. We could, for example, sample across the popu-
lation of deployed instances, profiling the behavior of a group
of users. This is advantageous because it would only add the
profiling overhead to a subset of the population. Still, the
overhead on this subset of instances could substantially af-
fect user’s activities, biasing the collected information.
An alternative sampling mechanism could consider multi-
ple dimensions. For example, we could stratify the popula-
tion of events to be profiled following a given criterion (e.g.,
events from the same functionality) and then sample across
the subgroups of events, generating a version of the program
with enough instrumentation to capture just those sampled
events. Then, by repeating the sampling process, different
versions of P?can be generated for distribution. Potentially,
each user could obtain a slightly different version that aims
to capture a particular sample of events at each site.2
More formally, given program P, a class of events to mon-
itor C, the stratified sampling strategy identifies s strata
C1,C2,...,Cs, selects a total of N events from C by ran-
domly picking ni events per strata, where i varies from 1 to
s and ni is proportional to the stratum size, and it gener-
ates P?to capture the selected events. As N gets smaller, P?
contains less instrumentation, enhancing transparency but
possibly sacrificing accuracy. By repeating the sampling and
generation process, P??,P???,...,Pmare generated, resulting
in versions with various suitable instrumentation patterns
available for deployment.3
There is an important tradeoff between the event sample
size N and the number of deployed instances. Maintaining
N constant while the number of instances increases results in
constant profiling transparency across sites. As the number
of deployed instances increases, however, the level of over-
lap in the collected data across deployed sites is also likely
to increase. We could then leverage this overlap to reduce
N, gaining transparency at each deployed site by collect-
ing less data while compensating by profiling more deployed
instances. Section 4.2 investigates this alternative.
Also note that, as defined, our sampling strategy provides
a statistical perspective for various algorithms implemented
by the Gamma research effort [14, 22], incorporating and
formalizing the procedure to jointly sample deployed sites
and events, but not considering the redistribution of versions
2.4 Trigger Data Transfers on Anomalies
Transferring data from deployed sites to the organization
can be costly for both parties. For the user, data transfer im-
plies at least additional computation cycles to marshal and
package data, bandwidth to actually perform the transfer,
and a likely decrease in transparency. For an organization
with thousands of deployed software instances, data collec-
tion might become a bottleneck. Even if this obstacle could
be overcome with additional collection devices (e.g., a clus-
ter of collection servers), processing and maintaining such a
data set could prove expensive. Triggering data transfers in
the presence of anomalies can help to reduce these costs.
Employing anomaly detection implies the existence of a
baseline behavior considered nominal or normal. When tar-
geting released software, the nominal behavior is defined by
what the engineers know or understand. For example, engi-
neers could define an operational profile based on the exe-
cution probability exhibited by a set of beta testers . A
copy of this operational profile could be embedded into the
released product so that deviations from its values trigger a
data transfer. Sessions that fit within the operational profile
would only send a confirmation to the organization, increas-
ing the confidence in the estimated profile; sessions that fall
outside an specified operational profile range are completely
transferred to the organization for further analysis (e.g., de-
2We could also perform the complement by stratifying the
user population and proceeding as specified. However, find-
ing subgroups of users might be difficult in the presence of
new or shifting user populations.
3In this work we focused on two particular dimensions:
space and deployed instances. However, note that sampling
techniques on time could be applied on the released instances
utilizing the same mechanism.
termine whether the anomaly indicates a potential problem,
update the profile if the anomaly was due to an incomplete
We now define the approach more formally. Given pro-
gram P, a class of events to monitor C, an in-house char-
acterization of those events Chouse, a tolerance to deviations
from the in-house characterization ChouseTolerance, this tech-
nique generates a program P?with additional instrumenta-
tion to monitor events in C, and a detection algorithm to
identify when field behavior Cfield deviates from [Chouse±
ChouseTolerance]. When such deviation is detected, session
data is transferred to the organization. Note that this defi-
nition of trigger by anomaly includes the type of behavioral
“mismatch” trigger mechanism used by EDEM  by mak-
ing ChouseTolerance = 0.
There are many interesting tradeoffs in defining and de-
tecting deviations from Chouse.
tradeoff between the level of investment on the in-house soft-
ware characterization and the number of false negatives re-
ported from the field. Also, there are myriad of algorithms
to detect anomalies, trading detection sensitivity and effec-
tiveness with execution overhead. We investigate some of
these tradeoffs in Section 4.3.
For example, there is a
2.5Previous Empirical Studies
The efforts to develop profiling techniques for released
software have been supported by different mechanisms.
Hilbert and Redmiles [11, 12] utilized scenarios to demon-
strate the concept of internet mediated feedback. The sce-
narios illustrated how software engineers could improve user
interfaces through the collected information. The scenarios
reflected the authors’ experiences in a real context, but they
did not constitute empirical studies.
The Skoll group has also started to perform studies to
study the feasibility of their infrastructure on two large open
source projects (ACE and TAO) . The feasibility studies
reported on the infrastructure’s capability in detecting fail-
ures in several configuration settings. Again, no empirical
evaluation has yet been made available.
Pavlopoulou and Young empirically evaluated the effi-
ciency gains of the residual testing technique on programs of
up to 4KLOC . The approach considered instrumenta-
tion probe removal, where a probe is the snippet of code in-
corporated into P to profile a single event. Executed probes
were removed after each test was executed, showing that in-
strumentation overhead to capture coverage can be greatly
reduced under certain conditions (e.g., similar coverage pat-
terns across test cases, non-linear program structure). Al-
though incorporating field data into this process was dis-
cussed, this aspect was not empirically evaluated.
The Gamma group performed at least two empirical stud-
ies to validate the efficiency of their distributed instrumen-
tation mechanisms. Bowring et al.  studied the variation
in the number of instrumentation probes and interactions,
and the coverage accuracy for two deployment scenarios.
For these scenarios, they employed a 6KLOC program and
simulated users with synthetic profiles. A second study by
the same group of researchers lead by Orso  employed
the created infrastructure to deploy a 60KLOC system and
gathered profile information on 11 users (7 from the research
team) to collect 1100 sessions. The field data was then used
for impact analysis and regression testing improvement. The
findings indicate that field data can provide smaller impact
sets than slicing and truly reflect the system utilization,
while sacrificing precision. The study also highlighted the
potential lack of accuracy of in-house estimates, which can
also lead to a more costly regression testing process.
Overall, when revisiting the previous studies in terms of
the profiling strategies we find that: 1) the transfer on
anomaly strategy has not been validated through empiri-
cal studies, 2) the targeted profiling strategy has not been
validated with deployment data and its efficiency analysis
included just four small programs, 3) the strategy involv-
ing sampling has been validated more extensively in terms
of efficiency but the effectiveness measures were limited to
coverage, and 4) each assessment was performed in isolation.
Our studies address those weaknesses by improving on the
• Target object and subjects.
are performed on a 155KLOC program utilized by 30
users, providing the most comprehensive setting yet to
study this topic.
Our empirical studies
• Comparison and integration of techniques with the
same context. We analyze the benefits and costs of
field data obtained with full instrumentation and com-
pared it against techniques utilizing targeting profil-
ing, sampling profiling, a combination of targeting and
sampling, and anomaly driven transfers.
• Assessment of effectiveness. We measure coverage ob-
tained by field instances but we also use field data to
generate test cases and measure their coverage. Fur-
thermore, we utilize the generated test suites on later
versions of the program to quantify faults detection
• Assessment of efficiency. In addition to counting the
number of instrumentation probes, we measure the
number of necessary data transfers (a problem high-
lighted but not quantified in ).
3. EMPIRICAL STUDY
This section introduces the research questions that serve
to scope this investigation. The metrics, object of study, and
the design and implementation follows. Last, we identify the
threats to validity.
3.1 Research Questions
We are interested in the following research questions.
RQ1: What is the potential benefit of profiling deployed
software instances? In particular, we investigate the
coverage and fault detection effectiveness gained through
the generation of a test suite based on field data.
RQ2: How effective and efficient are profiling techniques
designed to reduce overhead at each deployed site? We
investigate the tradeoffs between efficiency gains (as
measured by the number of probes required to profile
the target software), coverage gains, and data loss.
RQ3: Can anomaly based triggers reduce the number of
data transfers? What is the impact on the potential
gains? We investigate the effects of triggering trans-
fers when a departure from an operational profile is
Throughout the studies we mainly employ functional cov-
erage to measure the potential benefit of field data obtained
through the implemented profiling techniques for released
software. We made a decision to capture functional level
data in the field because of its relative low overhead (see
Section 3.3). In two of the studies we also indirectly quantify
the effectiveness of these profiling techniques by measuring
the coverage (function and block level) and fault detection
capabilities of test suites generated with field data.
We utilize two measures to to quantify the efficiency of
profiling techniques. First, we count the number of instru-
mentation probes per deployed instance. Second, we count
the number of transfers necessary to collect the field data.
We selected the popular4program Pine (Program for In-
ternet News and Email) as the object of the experiment.
Pine is one of the numerous programs to perform mail man-
agement tasks. It has several advanced features such as
support for automatic incorporation of signatures, inter-
net newsgroups, transparent access to remote folders, mes-
sage filters, secure authentication through SSL, and multi-
ple roles per user. In addition, it supports tens of platforms
and offers flexibility for a user to personalize the program
by customizing configuration files.
Several versions of Pine source code are publicly available.
For our study we primarily use the Unix build, version 4.03,
which contains 1373 functions and 155,037 lines of code in-
Test Suite. To evaluate the potential of field data to im-
prove the in-house testing activity we required an initial test
suite on which improvements could be made. Since Pine
does not come with a test suite, two graduate students, who
were not involved in the current study, developed a suite
by deriving requirements from Pine’s man pages and user’s
manual, and then generating a set of test cases that exercised
the program’s functionality. Each test case was composed of
three sections: 1) a setup section to set folders and config-
urations files, 2) a set of Expect commands  that allows
to test interactive functionality, and 3) a cleanup section to
remove all the test specific settings. The test suite consisted
of 288 automated test cases, containing an average of 34
Expect commands. The test suite took an average of 101
minutes to execute on a PC with an Athlon 1.3 processor,
512MB of memory, running Redhat version 7.2. When func-
tion and block level instrumentation was inserted, the test
suite execution required 103 and 117 minutes respectively
(2% and 14% overhead). Overall, the test suite covered 835
Faults. To quantify potential gains in fault detection effec-
tiveness we required the existence of faults. We leveraged
a parallel research effort by our team  that had resulted
in 43 seeded faults in four posterior versions of Pine: 4.04,
4.05, 4.10, and 4.20. (The seeding procedure follows the one
detailed in .) We then utilized these faults to quantify the
fault detection effectiveness of the test suites that leveraged
4Pine had 23 million users worldwide as of March 2003 .
3.4 Study Design and Implementation
The overall empirical approach was driven by the research
questions and constrained by the costs of collecting data for
multiple deployed instances. Throughout the study design
and implementation process, we strived to achieve a bal-
ance between the reliability and representativeness of the
data collected from deployed instances under a relatively
controlled environment, and the costs associated with ob-
taining such a data set. As we shall see, combining a con-
trolled deployment and collection process with aposteriori
simulation of different scenarios and techniques helped to
reach such a balance (potential limitations of this approach
are presented under threats to validity in Section 3.5).
We performed the study in three major phases: (1) object
preparation, (2) deployment and data collection, and (3)
processing and simulation.
The first phase consisted of instrumenting Pine to en-
able a broad range of data collection.
tion is meant to capture functional coverage information,
operational traces, accesses to environmental variables, and
changes in the configuration file occurring in a single session
(a session is initiated when the program starts and finishes
when the user exits). In addition, to enable further valida-
tion activities (e.g., test generation based on user’s session
data), the instrumentation also enables the capture of vari-
ous session attributes associated with user operations (e.g.,
folders, number of emails in folders, number and type of at-
tachments, errors reported on input fields). At the end of
each session, the collected data is time-stamped, marshaled,
and transferred to the central repository. For anonimization
purposes, the collected session data is packaged and labeled
with the encrypted sender’s name at the deployed site and
the process of receiving sessions is conducted automatically
at the server to reduce the likelihood of associating a data
package with its sender.
We conducted the second phase of the study in two steps.
First, we deployed the instrumented version of Pine at five
“friendly” sites. For two weeks we used this preliminary de-
ployment to verify the correctness of the installation scripts,
data capture process and content, magnitude and frequency
of data transfer, and the transparency of the de-installation
process. After this initial refinement period, we proceeded
to expand the sample of users. The target population cor-
responded to the approximately 60 students in our Depart-
ment’s largest research lab. After promoting the study for a
period of two weeks, 30 subjects volunteered to participate
in the study (members from our group were not allowed to
participate). The study’s goal, setting, and duration (45
days) was explained to each one of the subjects, and the
same fully-instrumented version of the Pine’s package was
made available for them to install. At the termination date,
1193 user sessions had been collected, an average of 1 ses-
sion per user per day (no sessions were received during 6
days due to data collection problems).
The last phase consisted of employing the collected data to
support different studies and simulating different scenarios
that could help us answer the research questions. The par-
ticular simulation details such as the manipulated variables,
the nuisance variables and the assumptions, vary depending
on the research question, so we address them individually
within each study.
3.5 Threats to Validity
This study, like any other, has some limitations that could
have influenced the results. Some of these limitations are un-
avoidable consequences of the decision to combine an obser-
vational study with simulation. However, given the cost of
collecting field data and the fundamental exploratory ques-
tions we are pursuing, these two approaches offered us a
good balance between data representativeness and power to
manipulate some of the independent variables.
By collecting data from 30 deployed instances of Pine dur-
ing a period of 45 days, we believe to have performed the
most comprehensive study of this type.
study is representative of many programs in the market, lim-
iting threats to external validity. Although our subjects are
students in our Department, we did not exercise any con-
trol during the period of study, which gives us confidence
that they behaved as any other user would under similar
circumstances. Still, more studies with other programs and
subjects are necessary to confirm the results we have ob-
tained. For example, we must include subjects that exercise
the configurable features of Pine and we need to distribute
versions of the program for various configurations.
Our simplifying assumptions about the deployment man-
agement is another threat to external validity. Although
some of the assumptions are reasonable or could be restated
in the simulation, empirical studies specifically including
various deployment strategies are required .
more, our studies assume that the incorporation of instru-
mentation probes and anomaly based triggers do not result
in additional program faults and that repeated deployments
are technically and economically feasible.
We are also aware of the potential impact of observational
studies on a subject’s behavior. Although we clearly stated
our goals and procedures, subjects could have been afraid
to send certain type of messages through our version of
Pine. This was a risk we were willing to take to increase the
chances of exploring different scenarios through simulation.
Overall, gaining users trust and willingness to be profiled
is a key issue for the proposed approaches to succeed and
should be the focus of future studies.
The in-house validation process helped to set a baseline for
the assessment of the potential of field data (RQ1) and the
anomaly based triggers for data transfers (RQ3). As such,
the quality of the in-house validation process is a threat to
internal validity which could have affected our assessments.
We partially studied this factor by carving weaker test suites
from an existing suite (Section 4.1) and by considering dif-
ferent number of users to characterize the initial operational
profiles (Section 4.3). Still, the scope of our findings are af-
fected by the quality of the initial suite.
To address RQ2, we employ the number of probes to esti-
mate the potential performance overhead caused by different
instrumentation strategies. This metric is limited in that it
does not consider whether the probes were executed in the
field, which equates to assigning the same execution like-
lihood to all probes. Still, counting the number of probes
required by an instrumentation technique is advantageous
because it can be computed statically, providing an evalua-
tion that is independent of how the application is exercised.
Future studies could further mitigate this threat to con-
struct validity by considering complementary performance
measures (e.g., execution time, ratio of probes over instruc-
Our program of