Content uploaded by Tassio Sirqueira

Author content

All content in this area was uploaded by Tassio Sirqueira on Jul 13, 2020

Content may be subject to copyright.

Content uploaded by Tassio Sirqueira

Author content

All content in this area was uploaded by Tassio Sirqueira on Jun 30, 2020

Content may be subject to copyright.

Sirqueira et al.

Application of Statistical Methods in Software

Engineering: Theory and Practice

Tassio Sirqueira1,2*, Marcos Miguel1, Humberto Dalpra1, Marco Antônio Araújo1,2,3 and José Maria

David1

*Correspondence:

tassio@tassio.eti.br

1University Center Academia

(UniAcademia), R. Halfeld, 1179

- Centro, 36016-000 Juiz de

Fora, Brazil

Full list of author information is

available at the end of the article

Abstract

The experimental evaluation of the methods and concepts covered in software

engineering has been increasingly valued. This value indicates the constant search

for new forms of assessment and validation of the results obtained in Software

Engineering research. Results are validated in studies through evaluations, which

in turn become increasingly stringent. As an alternative to aid in the veriﬁcation

of the results, that is, whether they are positive or negative, we suggest the use of

statistical methods. This article presents some of the main statistical techniques

available, as well as their use in carrying out the implementation of data analysis

in experimental studies in Software Engineering. This paper presents a practical

approach proving statistical techniques through a decision tree, which was

created in order to facilitate the understanding of the appropriate statistical

method for each data analysis situation. Actual data from the software projects

were employed to demonstrate the use of these statistical methods. Although it is

not the aim of this work, basic experimentation and statistics concepts will be

presented, as well as a concrete indication of the applicability of these techniques.

Keywords: Statistical Methods; Experimental Studies; Software Engineering;

Experimental Evaluation

1 Introduction

Software Engineering (SE) deals with the development, maintenance and manage-

ment of high quality software systems in a cost eﬀective and predictable manner.

Research in SE studies real world phenomena, addressing the development of new

systems, modiﬁcation of existing systems, as well as technology (process models,

methods, techniques, tools or languages) to support SE activities. Also, it provides

evaluations and comparisons of the eﬀect of using technology to support complex

interaction. Sciences that study real world phenomena, i.e. the empirical sciences,

consist of collecting information based on observation and experimentation, instead

of using logical or mathematical deductions [12].

An empirical approach of measuring SE technology, including industrial collabora-

tion, started on a large scale in the 1970s from [24] and [25] ’s work. Currently, there

is a growing appreciation of the experimental evaluation of methods and concepts

covered in software engineering. This increase indicates the search for new forms of

assessment and validation of results obtained in SE studies [18]. The production of

signiﬁcant empirical evaluations is facing many uncertainties and diﬃculties. Any

researcher can forget or overlook a seemingly innocuous factor, which may prove

to invalidate an entire work. Some essential experimental design guidelines can be

ignored during the process, resulting in distrust regarding the validity of much of

the work done [7].

Sirqueira et al. Page 2 of 22

Studies in the computational area generally converge to build software, algorithms

or models [13]. Results obtained from these studies are validated through evalua-

tions, which are increasingly stringent. As an alternative to aid in the validation

of either positive or negative results obtained from the study, we propose the use

statistical methods. Statistics are applied in various areas of knowledge, providing

methods for collection, organization, analysis and interpretation of data [2]. Statis-

tical power is an inherent part of experimental studies, which employs signiﬁcance

tests, essential for planning, construction and validity of the ﬁndings of a study [4].

Organizations usually question whether the expected results are being achieved

[10]. The answer to this question is not a trivial task, given that their actual per-

formance is not always known [14]. Statistical methods aim to provide numerical

information, where data study and analysis can be divided into three phases, [5]

(i) Data acquisition; (ii) Data description, classiﬁcation and presentation; and (iii)

Data conclusions. The author adds that the second phase is commonly known as

Descriptive Statistics. The third phasis is called Inferential Statistics, which is one

of the most important stages, since data collection and organization provides ﬁnd-

ings.

This article consists of three other sections. In section 2, the theoretical assumptions

are presented. In section 3, a practical approach for the use of the most common

statistical methods is discussed. Finally, Section 4 describes the relevance of the

application of statistical methods in studies geared towards Experimental Software

Engineering.

2 Theoretical Foundations

The discussion on the role of statistical analysis in Experimental Software Engineer-

ing (ESE), indicates the underutilization of statistical power when discussing the

results obtained [7]. This fact leads to the appearance of failed research projects, as

well as questionable validity of the results [4].

Software Engineering commonly applies scientiﬁc methods in the evaluation of the

beneﬁts obtained from a new technique, theory or method related to software [26].

This method has traditionally been successfully applied to other sciences, especially

social sciences. Social sciences resemble Software Engineering due to the increasing

importance of the human factor in software, since it is rarely possible to establish

laws of nature, as is done in physics or mathematics [1].

An experiment is an empirical examination and involves at least one treatment,

a measurement of results, allocation units and some comparisons, from which the

change can be inferred and assigned to the same [12]. According to [1], the trial

process is composed of four parts: deﬁnition, planning, execution and analysis. The

statistical inference techniques can be applied to both the planning and the anal-

ysis of the trial. Planning formulates the hypothesis of the research, identiﬁes the

dependent (response) and independent (factors) variables, selects the participants

and the methods of analysis. Furthermore, the study is projected, the instruments

are deﬁned and, ﬁnally, analyze the validity threats. The data analysis veriﬁes the

resulting graphs and descriptive statistics. It eliminates outliers (if applicable), an-

alyzes data distributions and applies statistical methods in order to achieve the

results.

Sirqueira et al. Page 3 of 22

The purpose of an experimental study is to collect data in a controlled environment

in order to conﬁrm or dismiss the hypothesis. Hypothesis refers to a theory which

seeks to explain a certain behavior of interest, to the research, and leads to the

deﬁnition of independent or dependent variables. Independent variables represent

the cause that aﬀects the outcome of the trial process. The eﬀect of the combination

of the values of these variables refers to the dependent variables [1]. These variables

can be quantitative, being expressed in numerical values that can be divided into

interval and ratio scales, or qualitative, when they are not numeric and can be di-

vided into nominal and ordinal scales [2].

According to [6], only the nominal measurements indicate the type of data, and the

only possible operation is to check wether the data has either one value or another,

for example, the speciﬁcation of the gender of individuals. Ordinal measures also

give classes to the data, but you can sort them from highest to lowest and can cite

as an example “Level 2", which is smaller than “Level 3" in the CMMI model. The

interval measurement assigns a real number to the data, with the zero value be-

ing arbitrary in the scale. An example of interval measurement is the temperature

in degrees Celsius. As for the ratio measurements, they assign a real number to

the data, where zero is absolute,e.g., the distance in meters between two objects. A

comparison of the characteristics of these scales can be seen in Table 1, as presented

by [1].

Scale Nominal Ordinal Interval Ratio

VALUE COUNTING X X X X

VALUE ORDINATION X X X

EQUIDISTANT RANGES X X

VALUE ADDITION

AND SUBTRACTION X

VALUE DIVISION X

Table 1: Characteristics of the values.

After collecting data from an experimental study, descriptive statistics are used

to specify relevant characteristics, such as, (i) to indicate the middle of the set of

observed data by means of central tendency measurements, (ii) to understand the

average, median and mode values. The average is calculated from the sum of the

collected values, divided by their number. The median is calculated by arranging

the values in ascending (or descending) order and selecting the midpoint. The mode

is calculated by counting the number of occurrences of each value and selecting the

most common one. Other relevant measures are the minimum value, which is the

lowest value among the data collected, the maximum value, which is the highest

value among the data collected, the percentile, which divides the sample into values

smaller than the size of sample quartile, which represents the 25% percentile (or

ﬁrst quartile), the median (second quartile) and the 75% percentile (third quartile)

[1].

In order to measure the extent to which values are dispersed or concentrated in

relation to their midpoint, dispersion measurements, including track, variance and

standard deviation are used. Range represents the diﬀerence between the highest

and the lowest value collected. Variance is the sum of the square diﬀerence between

each value and the average of the collected values divided by the number of col-

lected values, minus 1. Standard deviation is the root of the variance, which is one

Sirqueira et al. Page 4 of 22

commonly used measurement [1].

A statistical hypothesis is a conjecture about unknown aspects in a data sample

observed in a study, which can be proved or dismissed through a hypothesis test

[15]. Hypothesis testing requires the speciﬁcation of an acceptable level of statistical

error, i.e. the risks the study is exposed to by decision-making [9][8]. To carry out a

hypothesis test, a null hypothesis is deﬁned, identiﬁed as H0, which is a statement

according to which there is no diﬀerence between the parameter and the statistic

you are comparing, and the alternative hypothesis, identiﬁed as H1, which contra-

dicts H0 [3]. In general, it seeks to reject the null hypothesis in order to demonstrate

that variations in the sample obtained with some intervention, or treatment, are

not accidental.

There are two possible types of error, such as, (i) the Type I error, which occurs

when the statistical test indicates that there is a relation of cause and incorrect

eﬀect, also called False Positive, and (ii) Type II error, wherein the statistical test

does not indicate the existence of a cause and eﬀect relationship [3].

The probability of generating a Type I error (alpha) is related to the signiﬁcance

level of hypothesis testing. The lower the signiﬁcance level is, the greater the as-

surance that a relationship is not identiﬁed [11]. The statistical signiﬁcance of the

result is an estimated measurement of the degree of accuracy of the result, i.e., the

p-value is a decreasing index of the reliability of a result [19]. In many research

areas, the 0.05 p-level is customarily treated as an “acceptable range" of error, since

in particular research areas,such as in medical areas, the p-level can reach 0,001

and it is often called “highly" signiﬁcant, however, it is more susceptible to Type II

error [6].

Figure 1 shows a graphical representation of accepting the null hypothesis, accord-

ing to the 5% signiﬁcance level.

Figure 1: Graphical representation of accepting the null hypothesis.

In order to reduce type I and II experimental errors, there are experimental designs

which refer to how the treatment or factor levels are assigned to experimental units

or portions [20]. Some of the main experimental designs are: completely randomized

design, randomized block design and the Latin square design [21].

The completely randomized design is used when the variability between plots is

Sirqueira et al. Page 5 of 22

small or nonexistent. In a randomized block design, the experimental material is

divided into homogeneous groups. The objective at all stages of the experiment is

to keep the error within each block, as small as possible in practical terms. For the

Latin square design, treatments are grouped into repetitions in two distinct ways.

This systematization of the blocks in two directions generically called “lines" and

“columns" allows the elimination of variation eﬀects in the experimental error.

Hypothesis tests can be parametric, using the parameters of the distribution, or

an estimate of them, or nonparametric [16], which use posts assigned to the sorted

data to calculate their statistics,they are free of the probability distribution of the

data studied [22]. In parametric tests, it is assumed that their distribution is known.

Although it is a better approach, it is necessary to ensure data normality and ho-

moscedasticity, which refers to the less dispersed (concentrated) data around the

regression line of the model. In nonparametric tests the sample distribution is not

known, culminating in a less precise approach [3] [17].

When planning an experiment, one must choose the variable whose eﬀects one wants

to observe, which is called the “factor". The categories of the factor under study

are called “treatments" [21]. In general, the purpose of an experimental study is to

compare treatments, and verify whether these have the same eﬀect on a measured

characteristic, or whether at least one of them has a diﬀerent eﬀect when compared

to the others [21].

Figure 2 shows a decision tree to facilitate the selection of an appropriate statistical

method in order to carry out hypothesis tests, according to the characteristics of

the sample and the study performed.

Next you can see the concepts of the tests used in the ﬂow chart as shown in Figure

2, according to [1].

•The Kolmogorov-Smirnov (KS) test enables the evaluation of the similarities

between the distribution of two samples. It may also indicate the similarity in

the distribution of a sample in relation to a classic distribution, such as the

normal distribution.

•The Shapiro-Wilk test is used to calculate the W value, which refers to the

evaluation of a sample Xj with regard to normal distribution. In general, the

relevant test is used for small data sets.

•The Levene test considers the assumption that the variances are homogeneous

if the W value is smaller than the value of the normal distribution.

•The T or Student-t test is used to mean the comparison of two independent

samples. In this, various tests are performed based on diﬀerences detected in

the sample variances.

•The ANOVA method is a statistical technique which aims to test the equality

of the average of two or more groups.

•The Tukey test, commonly used along with the ANOVA test, helps to identify

samples whose averages diverge.

•The Mann-Whitney test refers to an alternative, non-parametric test for the

T test. In order to perform the Mann-Whitney test samples have to be inde-

pendent, with ordinal, interval or ratio scales.

•The Kruskal-Wallis test is a nonparametric alternative to the analysis of vari-

ance (ANOVA) which, just like most parametric tests, is based on the replace-

ment of values by their ranking in the set of all values.

Sirqueira et al. Page 6 of 22

Figure 2: Decision tree.

3 The decision tree in action

This section presents an example whose main objective is to demonstrate the ap-

plication of the aforementioned statistical techniques. It aims to demonstrate, step-

by-step, how to use the methods laid out in the decision tree shown in Figure 2.

For the demonstration, data extracted from a real basis belonging to a software

development company will be used, which is shown in Table 2. It is desirable to

verify the gains in the automated calculation of the development time, performed

through a plugin, as opposed to the manual calculation of development time per-

formed based on the experience of the developers involved in the project. The ﬁrst

8 months (12/2013 to 07/2014) presented in the table reﬂect the planning period

Sirqueira et al. Page 7 of 22

by the development team, the constant maintenance in the list of change orders.

Values shown in the columns “Expected Hours" and “Held Hours" demonstrate the

time spent by the staﬀ, in hours, on the proper progress or completion of change

requests. These are calculated manually, by empiricism, based on the experience of

the developers involved. The following 8 months (08/2014 to 03/2015) comprise the

period regarding the same of the maintenance planning, in which the development

time is calculated through an automated calculation plugin. It was based on the

history of maintenance performed by the team. The “Diﬀerence (Expected - Held)"

displays the time diﬀerence between expected and held time throughout the whole

process.

The Minitab tool [27], version 17, was used in order to generate the analysis pre-

sented below.

Table 3 shows the results of analyses of the data trend measurements which will be

analyzed and discussed in the example presented in this article. These data were

calculated based on Table 2. The information presented in Table 3 can be gener-

ated in Minitab through the menu “Stat –>Basic Statistics –> Display Descriptive

Statistics" and selecting the variable “Diﬀerence (Expected x Held)".

This information helps in analyzing the data generated in Figure 3, as the value of

“Minimum", which is the smallest existing value for the variable “Diﬀerence (Ex-

pected x Held)" and “Maximum" being the highest value. The “range" is the diﬀer-

ence between the lowest and highest value in the variable “Diﬀerence (Expected x

Held)", the 1st and 3rd Quartiles are calculated based on the comparison with the

variable “Time" and the standard deviation is the diﬀerence between the median of

each moment.

The boxplot is a graph used to evaluate the distribution of the empirical data, which

is formed by the ﬁrst and third quartiles and the median. Analyzing the boxplot,

we can observe the ﬁgures associated with the moments before and after the imple-

mentation of the plugin, in relation to the median drawn.

The generation of this graph can be made in Minitab through the menu “Graph

Box plot" option “One Y / With Groups" by selecting the “Diﬀerence (Expected x

Held)" as the variable and, as for the category, by selecting the variable “Moment".

Another possible check on the presented data is the outlier analysis, which refers

to the observations in the samples which are either very far from the others or are

inconsistent when compared to them. These observations are also called abnormal,

contaminant, strange, extreme or discrepant [23].

It is important to know the reasons that lead to the appearance of outliers so that

you can determine the correct way to treat them. Among the possible reasons for

the appearance of outliers, measurement errors, data running or inherent variabil-

ity of the elements of the population [23] can be highlighted. An outlier resulting

from collection or measurement errors should be discarded. However, if the observed

value is possible, the outlier should not necessarily be discarded.

In order to identify outliers it is necessary to calculate the median, lower quartile

(Q1) and the upper quartile (Q3) of the samples. After that, the upper quartile

must be subtracted from the lower quartile and the result must be stored in L).

The values of the intervals Q3+1,5L and Q3+3L, and Q1-1,5L and Q1-3L will be

Sirqueira et al. Page 8 of 22

Year/Month Held Hours Expected Hours Number of Cases Cases Size Diﬀerence (Expected - Held) Moment

2013/12 259.878 100.000 36 M -159.878 Before

2014/01 749.272 580.000 84 L -169.272 Before

2014/02 570.343 480.000 74 L -90.343 Before

2014/03 535.014 480.000 74 L -55.014 Before

2014/04 311.262 90.000 33 S -221.262 Before

2014/05 285.988 80.000 28 S -205.988 Before

2014/06 279.633 80.000 28 S -199.633 Before

2014/07 256.495 480.000 52 M 223.505 Before

2014/08 437.427 680.000 52 M 242.573 After

2014/09 450.845 395.367 58 M -55.478 After

2014/10 225.472 517.222 75 L 291.750 After

2014/11 602.305 791.996 95 L 189.691 After

2014/12 450.147 452.305 62 M 2.158 After

2015/01 327.089 516.024 70 L 188.935 After

2015/02 258.536 503.461 65 L 244.925 After

2015/03 310.315 620.772 80 L 310.457 After

Table 2: Planning Data (Expected and Held Values).

Sirqueira et al. Page 9 of 22

TREND MEASUREMENTS VALUES - Diﬀerence (Expected - Held)

AVERAGE 33.6

MEDIAN -26.4

MODE * (NUMBER OF MODES 0)

TRACK 531.7

MINIMUM -221.3

MAXIMUM 310.5

1st QUARTILE -166.9

3rd QUARTILES 237.8

VARIANCE 40244.0

STANDARD DEVIATION 200.6

Table 3: Analysis of measurements.

Figure 3: Boxplot for the variables Diﬀerence (Expected - Held) x Moment.

considered outliers and can be accepted in the population. Values greater than Q3

+ 3L and smaller than Q1-3L should be considered outliers. In this case, the origin

of the dispersion must be investigated, because they are the most extreme points

analyzed [23].

Figure 4 shows the outliers regarding the analysis of the Diﬀerence (Expected -

Held) variables. As can be seen, the samples did not have values that can be con-

sidered outliers, assuming a 5% signiﬁcance level.

3.1 Analysis with 1 Factor and 2 Treatments using the parametric statistical method

The analysis of Table 3 begins by checking data normality, for which the columns

considered are “Diﬀerence (Expected - Held)" and “Moment". It can be seen that

the table has fewer than 30 samples, thus the Shapiro-Wilk test is used, as shown

in Figure 2, which shows the decision tree. The following assumptions should be

considered:

•H0 (null hypothesis): The samples have normal distribution.

Sirqueira et al. Page 10 of 22

Figure 4: Outlier analysis for the variables “Diﬀerence (Expected - Held)" and “Mo-

ment".

•H1 (alternative hypothesis): The samples did not show normal distribution.

To carry out this test in Minitab, the menu option “Stat –> Basic Statistics –>

Normality Test ..." and the variable “Diﬀerence (Expected - Held)" must be selected.

The normal test should be applied for each variable, individually. Since the variable

“Moment" is nominal and presents only two options (before and after), it does not

require the normality test. In Figures 5a and 5b it is noted that, at a signiﬁcance

level of 5%, samples are normal, since they have a p-value of 0.067, higher than the

0.05 signiﬁcance level. It indicates the acceptance of H0, and samples show normal

distribution.

Following the deﬁnition of the decision tree, we should check the homoscedasticity

of the samples. For this purpose we need to check the following hypotheses:

•H0 (null hypothesis): Samples are homoscedastic.

•H1 (alternative hypothesis): Samples are not homoscedastic.

Veriﬁcation of the homoscedasticity test must be applied to the two variables in-

volved in the hypothesis test. One must check whether the two samples are ho-

moscedastic to each other. In Minitab, this can be done via the menu “Start –>

Basic Statistics –> 2 Variances ..." and by selecting the two variables to be com-

pared.

As shown in Figure 6, which displays the Levene test to check the variance equality

with a 5% signiﬁcance level, we obtain a p-value of 0.913. It appears that the sam-

ples are homoscedastic, given that the p-value is greater than the 5% signiﬁcance

level, which indicates that a parametric test is needed.

For the statistical analysis, the parametric T test will be used according to the

decision tree due to thethe number of treatments (2 factors: “Before" and “After"

Sirqueira et al. Page 11 of 22

(a) Moment: Before

(b) Moment: After

Figure 5: Normality analysis for the variable Diﬀerence (Expected - Held).

the adoption of time estimation plugin). This test is applied by considering the

following assumptions:

•H0 (null hypothesis): There is no diﬀerence between the means.

•H1 (alternative hypothesis): There are diﬀerences between the means.

Sirqueira et al. Page 12 of 22

Figure 6: Homoscedasticity analysis for the variables “Diﬀerence (Expected - Held)"

and “Moment".

In order to run T test in Minitab, we must access the menu “Stat –> Basic Statistics

–> 2-Sample t ...".

Upon application of the respective test at a 5% signiﬁcance level, it is observed in

Figure 7, that with a p-value equal to 0.001, we can reject the null hypothesis that

the means are statistically equivalent. Thus, there are signiﬁcant diﬀerences, from

a statistical point of view, for the samples.

Figure 7: Test T analysis for the variable “Diﬀerence (Expected - Held)" x “Mo-

ment".

With the result of the analysis, we conclude that there is a signiﬁcant diﬀerence

between the means at a 5%‘ signiﬁcance level, with regard to time diﬀerence before

Sirqueira et al. Page 13 of 22

and after adoption of the plugin. Thus, since the averages are signiﬁcantly diﬀerent,

and the mean value is greater using the plugin, we can conclude that the use of the

plugin was positive with regard to time diﬀerence.

3.2 Analysis with 1 Factor and 2 Treatments using the non-parametric statistical

method

Continuing the demonstration of statistical tests, the exempliﬁcation of the non-

parametric tests is started. It starts with the normality check, for which the columns

“Expected Hours" and “Moment" are taken into consideration. It can be seen that

the table has less than 30 samples, therefore, the Shapiro-Wilk test is used, as shown

in Figure 2.

The following assumptions should be considered:

•H0 (null hypothesis): The samples have normal distribution.

•H1 (alternative hypothesis): The samples did not show normal distribution.

As shown in Figures 8a (a) and 8b (b), the samples exhibit a normal distribution

with a p-value of 0.066 for the variable “Predicted Hours". Therefore, their ho-

moscedasticity must be checked. For this purpose, we need to check the following

hypotheses:

•H0 (null hypothesis): Samples are homoscedastic.

•H1 (alternative hypothesis): Samples are not homoscedastic.

As shown in Figure 9, which displays the Levene test, comparing the variables

“Expected Hours" and “Moment", a p-value of 0.006 is obtained. It is then noticed

that the samples are not homoscedastic, given that one of the variables presented

a p-value lower than the 5% signiﬁcance level, thus indicating the need for a non-

parametric test.

In order to illustrate the use of a non-parametric method for one factor and two

treatments, the Mann-Whitney test (alternative to the t-test) is used. To conduct

this test, the variables “Diﬀerence (Expected - Held)" and “Moment" were compared,

since the same have two treatments (“Before" and “After" the adoption of the time

estimation plugin). This test is applied considering the following assumptions:

•H0 (null hypothesis): There is no diﬀerence between the means.

•H1 (alternative hypothesis): There are diﬀerences between the means.

The non-parametric Mann-Whitney test can be performed by the Minitab menu

“Stat –> Nonparametrics –> Mann-Whitney ...".

Figure 10 shows the Mann-Whitney test for the veriﬁcation of average values, it

appears that the samples do not have a signiﬁcance level from a statistical stand-

point, since they have an index lower than 5%. Thus, the null hypothesis is rejected,

which indicates that there is no signiﬁcant diﬀerence between the means. The use

of the plugin has brought signiﬁcant beneﬁts with regard to the diﬀerence in hours

between the expected hours and the held hours in the project.

After analyzing the time diﬀerence between the values before and after the adoption

of the time estimation plugin, it can be said that there is a signiﬁcant diﬀerence

from a statistical point of view, considering a 5% signiﬁcance level, between the

mean values observed.

Sirqueira et al. Page 14 of 22

(a) Time: Before

(b) Time: After

Figure 8: Analysis of the variables “Predicted Hours".

3.3 Analysis with 1 Factor and more than 2 Treatments using the parametric

statistical method

In order to illustrate the use of ANOVA, the normality of the variable “Diﬀerence

(Expected - Held)" contained in Figure 5, and the separation of the variable “Num-

Sirqueira et al. Page 15 of 22

Figure 9: Homoscedasticity analysis for the variables “Expected Hours" and “Mo-

ment".

Figure 10: Mann-Whitney test for the variable “Diﬀerence (Expected - Held)" and

“Moment" .

ber of Cases" into three groups, which are, respectively, “Small" (S), where the

number of cases is smaller than 33, “Medium" (M), where the number of cases is

between 34 and 66, and “Large" (L), where the number of cases is higher than 66.

This separation takes into account the fact that each version of the software devel-

oped by the company does not exceed 100 cases. This separation is represented in

Table 2 in the “Cases" column.

For this purpose the following hypotheses needs to be considered:

Sirqueira et al. Page 16 of 22

•H0 (null hypothesis): Samples are homoscedastic.

•H1 (alternative hypothesis): Samples are not homoscedastic.

As shown in Figure 11, which shows the Levene test to check variance equality,

with a 5% signiﬁcance level, we obtain a p-value of 0.220. It is then conﬁrmed that

the samples are homoscedastic, given that they present a p-value greater than the

signiﬁcance level of 5%, thus indicating that a parametric test is needed.

Figure 11: Homoscedasticity test for the variable Expected x Held in hours.

For the statistical analysis, the ANOVA parametric test (alternative to the non-

parametric Kruskal-Wallis) will be used according to the decision tree, due to the

comparison between the “Diﬀerence (Expected - Held)" and “Cases" of the samples.

This test is applied considering the following assumptions:

•H0 (null hypothesis): There is no diﬀerence between the means.

•H1 (alternative hypothesis): There are diﬀerences between the means.

The ANOVA test is available in Minitab through the menu “Stat –> ANOVA –>

One-Way ...” by selecting “Diﬀerence (Expected - Held)" as a variable, and “Cases"

as treatment.

Upon the application of the respective test, at a 5% signiﬁcance level, and with

a p-value equal to 0.045, Figure 12 shows that we can accept the null hypothesis

that the mean values are statistically equivalent. Upon completion of this test, we

can also get a graphical analysis of the ANOVA test, with a comparison of means

according to the treatment, as shown in Figure 13.

By applying the Tukey test, available through the “Comparisons ..." button in the

ANOVA analysis, to the same variables discussed above, we can obtain the statistical

analysis according to the treatment, which is used when the means analyzed by

ANOVA are lower than the signiﬁcance level, as shown both in Figure 14 and in

Sirqueira et al. Page 17 of 22

Figure 12: Analysis of the ANOVA test for the variables “Diﬀerence (Expected -

Held)" and “Cases Size".

the graphical analysis of Figure 15.

The ANOVA method shows that there is a diﬀerence between the expected and the

held time in relation to the number of cases, and this analysis can be conﬁrmed

through the Tukey method, since the diﬀerence of mean values between a small

number and a large number of cases is signiﬁcant at a level of 5%.

3.4 Analysis with 1 Factor and 2 more Treatments using the non-parametric statistical

method

Considering the normality of the data arranged in Figure 8, and that the decision

tree is still taken into consideration, the next step refers to the comparison of

mean values with regard to their homoscedasticity, which is applied to the variables

Sirqueira et al. Page 18 of 22

Figure 13: Graphical analysis of the ANOVA for the variables “Diﬀerence (Expected

- Held)" and “Cases Size".

Figure 14: Analysis of the ANOVA test using theTukey test for the variables “Dif-

ference (Expected - Held)" and “Cases Size".

“Expected Hours", since normality has already been checked, as shown in Figure 8,

and to “Cases Sizes", due to the number of treatments (3). For this purpose, the

following assumptions need to be considered:

•H0 (null hypothesis): Samples are homoscedastic.

•H1 (alternative hypothesis): Samples are not homoscedastic.

As shown in Figure 16, which shows the Levene test to check variance equality ,

with a 5% signiﬁcance level, a p-value of 0.001 is obtained. It appears then that

the samples are not homoscedastic, given that the p-value is lower than the 5%

signiﬁcance level, thus indicating the need for a non-parametric test.

Sirqueira et al. Page 19 of 22

Figure 15: Graphical analysis of the ANOVA test using the Tukey test for the

variables “Diﬀerence (Expected - Held)" and “Cases Size".

Figure 16: Graphical analysis of the Levene test for the variables “Expected Hours"

and “Cases Size".

We used the nonparametric Kruskal-Wallis test (a non-parametric alternative to

ANOVA). This test is applied considering the following assumptions:

Sirqueira et al. Page 20 of 22

•H0 (null hypothesis): There is no diﬀerence between the means.

•H1 (alternative hypothesis): There are diﬀerences between the means.

By applying this test at a 5% signiﬁcance level, a p-value equal to 0.011 can be seen

in Figure 17, thus indicating the rejection of the null hypothesis that the means are

statistically equivalent.

Figure 17: Result of the Kruskal-Wallis test.

As the result obtained from the Kruskal-Wallis test was the rejection of the null

hypothesis (H0), the Mann-Whitney test for comparison between groups must be

applied. This second analysis demonstrates that there is a signiﬁcant diﬀerence be-

tween the mean values, considering a 5% signiﬁcance level. The comparison between

groups (M-L, L-S, M-S) can be seen in Figures 18, 19 and 20, respectively.

Figure 18: Result of the Mann-Whitney test between “Expected Hours" and Cases

M-L.

As a result, there is a signiﬁcant diﬀerence between the mean values at a 5% sig-

niﬁcance level, which indicates the acceptance of the alternative hypothesis (H1).

4 Conclusion

Based on the results obtained from the statistical analysis of the samples, we can

mention that the use of the plugin showed improvements from a statistical point of

view, compared to the empirical methodology initially adopted.

Sirqueira et al. Page 21 of 22

Figure 19: Result of the Mann-Whitney test between “Expected Hours" and Cases

L-S.

Figure 20: Result of the Mann-Whitney test between “Expected Hours" and Cases

M-S.

The use of statistical methods has gained increasing attention within research in Ex-

perimental Software Engineering, thus showing their level of importance. It is also

important to highlight that statistical methods are used for planning and conduct-

ing a study, data description and decision-making, where one can cite the hypothesis

tests that are based on the risks associated with them. The formulation of hypothe-

ses arising from statistical methods has been widespread. In order to decide whether

a particular hypothesis is supported by a set of data, you must have an objective

procedure by which to accept it or reject it. By formulating a decision on the null

hypothesis (H0), two diﬀerent errors can occur. The ﬁrst, called a Type I error,

comprises rejecting H0 when it is true. The second, called a Type II error, happens

when one accepts H0 when it is false.

In order to facilitate the choice of an appropriate statistical method depending on

the sampling characteristics, this paper presents a decision tree, which shows the

tests to be applied at each stage of data analysis.

Sirqueira et al. Page 22 of 22

Statistical analyses of Samples may help researchers, software engineers and team

leaders, in making important decisions about the risk associated with the project.

This approach can produce greater maturity on the part of software engineers, con-

sidering that theywould no longer develop software based on assumptions, but based

on facts, drawn from the statistical analyses of data from the software development

processes.

As a further work, we need to carry out additional experiments that consider the

use of our proposal in other real world contexts.

5 acknowledgements

The authors thank IF Sudeste MG, CAPES and UFJF for the support and encour-

agement of research.

Author details

1University Center Academia (UniAcademia), R. Halfeld, 1179 - Centro, 36016-000 Juiz de Fora, Brazil.

2Postgraduate Program in Computer Science - Federal University of Juiz de Fora (UFJF), Juiz de Fora, Brazil.

3Federal Institute of Education, Science and Technology of Southeast Minas Gerais (IF Sudeste MG), Rua

Bernardo Mascarenhas, 1283 - Bairro Fábrica, 36080-001 Juiz de Fora, Brazil.

References

1. ARAÚJO, M. A P. et al. Métodos Estatísticos aplicados em Engenharia de Software Experimental. XXI

SBBD - XX SBES, 2006. (in portuguese).

2. BATTISTI, I. D. E.; BATTISTI, G. Métodos estatísticos. 2008. (in portuguese).

3. COOPER, D. R.; SCHINDLER, P. S. Métodos de pesquisa em administração. 7. ed. 2011 (in portuguese).

4. DYBÂ, T. et al. A systematic review of statistical power in software engineering experiments. Information

and Software Technology, v. 48, n. 8, p. 745-755, 2006.

5. FERNANDES, E. M. da G. P. Estatística Aplicada. Braga: American Mathematical Society, 1999. (in

portuguese).

6. HAIR, J. F. et al. Análise multivariada de dados. Bookman, 2009. (in portuguese).

7. MILLER, J. et al. Statistical power and its subcomponents—missing and misunderstood concepts in

empirical software engineering research.Information and Software Technology, v. 39, n. 4, p. 285-295, 1997.

8. NEYMAN, J.; PEARSON, E. S. On the problem of the most eﬃcient tests of statistical hypotheses,

Transactions of the Royal Society of London Series A 231. P. 289-337, 1933.

9. NEYMAN, J.; PEARSON, E. S. On the use and interpretation of certain test criteria for purposes of

statistical inference: Part I. Biometrika, p. 175-240, 1928.

10. ROCHA, A. R. C. et al. Medição de Software e Controle Estatístico de Processos. Ministério da Ciência,

Tecnologia e Inovação, Secretária de Política de Informática, n. 8, 2012.(in portuguese).

11. SHIMAKURA, S. E. Bioestatística A. Departamento de Estatística, UFPR. Available

in:http://leg.ufpr.br/ shimakur/CE055/node74.html. Access: 25/sep/2014.(in portuguese).

12. SJOBERG, D. IK et al. The future of empirical methods in software engineering research. In: Future of

Software Engineering, FOSE’07. IEEE, p. 358-378, 2007. .

13. WAINER, J. Métodos de pesquisa quantitativa e qualitativa para a Ciência da Computação. Atualização

em informática, p. 221-262, 2007. (in portuguese).

14. FLORAC, W. A.; CARLETON, A. D. Measuring the software process: statistical process control for

software process improvement. Addison-Wesley Professional, 1999.

15. MONTGOMERY, D. C.; RUNGER, G. C. Applied statistics and probability for engineers. John Wiley &

Sons, 2010.

16. SIEGEL, S. Nonparametric statistics for the behavioral sciences. 1956.

17. CÂMARA, F. G.; SILVA, O. Estatística Não Paramétrica - Testes de Hipóteses e Medidas de Associação.

Departamento de Matemática - Universidade dos Açores, Ponta Delgada, 2001. (in portuguese).

18. JURISTO, N.; MORENO, A. (2002) ”Reliable Knowledge for Software Development", IEEE Software, pp.

98-99, sep-oct, 2002.

19. MAHADEVAN, S.; HALDAR, A. Probability, reliability and statistical method in engineering design. John

Wiley & Sons, 2000.

20. MARKOWITZ, H. M. Mean-Variance Analysis. Palgrave Macmillan UK, 1989.

21. NATRELLA, M. G. Experimental statistics. Courier Corporation, 2013.

22. HOLLANDER, M. et al. Nonparametric statistical methods. John Wiley & Sons, 2013.

23. BARNETT, V.;TOBY, L. Outliers in statistical data. Vol. 3. New York: Wiley, 1994.

24. BASILI, V.R. et al. Lessons learned from 25 years of process improvement: the rise and fall of the NASA

Software Engineering laboratory. ICSE-24, pp. 69-79, 2002.

25. BOEHM, B. et al. Foundations of Empirical Software Engineering, The Legacy of Victor R. Basili,

Springer-Verlag, 2005.

26. WOHLIN, C. et al. Experimentation in software engineering. Springer Science & Business Media, 2012.

27. MINITAB. Available in: http://www.minitab.com. Access: 15/feb/2016.