Content uploaded by Lucas Gren
Author content
All content in this area was uploaded by Lucas Gren on Jun 04, 2018
Content may be subject to copyright.
Useful Statistical Methods for Human Factors Research in
Software Engineering: A Discussion on Validation with
Quantitative Data
Lucas Gren
Chalmers and University of Gothenburg
Gothenburg, Sweden 412–92 and
University of São Paulo
São Paulo, Brazil 05508–090
lucas.gren@cse.gu.se
Alfredo Goldman
University of São Paulo
São Paulo, Brazil 05508–090
gold@ime.usp.br
ABSTRACT
In this paper we describe the usefulness of statistical val-
idation techniques for human factors survey research. We
need to investigate a diversity of validity aspects when cre-
ating metrics in human factors research, and we argue that
the statistical tests used in other fields to get support for
reliability and construct validity in surveys, should also be
applied to human factors research in software engineering
more often. We also show briefly how such methods can be
applied (Test-Retest, Cronbach’s α, and Exploratory Factor
Analysis).
CCS Concepts
•Applied computing →Mathematics and statistics;
Law, social and behavioral sciences;
Keywords
Human factors; Psychology; Quantitative data; Statistical
tests; Validation
1. INTRODUCTION
Science has from the beginning contributed enormously to
the development of mankind. We have successfully observed
the world and created models that help us understand and
predict a diversity of events in the world, such as describing
waves or the photoelectric effect. However, it is important
to note that our predictive models are only models. As the
famous statistician George E.P. Box said:
“Essentially, all models are wrong, but some are
useful.” p. 424 [3].
The problem is that in more complex systems the deter-
ministic models are no longer useful in the same way. This is
ACM ISBN 978-1-4503-2138-9.
DOI: http://dx.doi.org/10.1145/2897586.2897588
when the mathematical models can be extended with prob-
abilities. Stochastic models that express how likely an event
is to occur then makes way more sense than setting out to
describe all variables deterministically (which is often not
feasible) [1].
The human mind is excellent at seeing patterns in a huge
number of variables [15]. Therefore, when investigating hu-
man factors, it often makes sense to collect qualitative data
and let the researchers (preferably, independently of each
other) systematically look for patterns in the data set (e.g.
a grounded theory approach) [9]. However, as it is always
good to look at a phenomenon (or construct) from different
perspectives, a triangulation is always preferred. Therefore,
it makes sense to collect quantitative data as well as quali-
tative and use statistical methods to analyze the former, i.e.
using both words and numbers in the analysis [18].
Empirical software engineering (ESE) is a quite new re-
search field compared to, for example, psychology. ESE has
come a long way and made great advances, and some re-
searchers have stressed the importance of evidence-based
software engineering (see e.g. [14]). They highlight an exam-
ple were the same data was used for validation as for factor
extraction, which of course should never be allowed. How-
ever, we believe the software engineering field is ready for
having a more scientific and precise guidelines to quantita-
tive survey data for human factors research. There seems to
be a gap in the usage of statistical validation between the
more technical aspects of software engineering and human
factors aspects.
When building prediction models for software, for exam-
ple, a factor analysis seems to be recommended [11], but in
softer software engineering aspects it is not standard prac-
tice and many authors get survey studies published without
such validation of scales (see e.g. [4, 2, 7]). When it comes
to human factors research in software engineering, it seems
to be more depending on the statistical knowledge of the au-
thor then common practice for publication, i.e. many human
factors survey studies have such tests in software engineering
(see e.g. [27]), but far from all.
Such methods are well-used in high impact management
journals as well (see e.g. [21, 8]). The problem is that if one
skips this part and directly run statistical tests on, for exam-
ple, a measurement’s correlation to another, they make little
sense since we do not know if we managed to measure the
intended construct. Skipping such validation would make
editors of more mature fields reject the manuscript [23]. A
validation process is of course not only about statistical val-
idation tests, but they should be conducted as an important
step in the validation process [13].
This paper will present some of these methods that have
frequently been used in e.g. psychology for almost a century,
directly applicable to human factors research in software en-
gineering. This paper is organized as follows: Section 2
describes the similarities between human factors research in
software engineering and other more mature fields, Section 3
presents statistical test often used in the such fields, and Sec-
tion 4 discusses the previous sections and their implications
to software engineering research.
2. SOCIAL SCIENCE RESEARCH
Many subareas within software engineering have social sci-
ence aspects. Many researchers have stated, for example,
that agility is undeniably a soft as well as a hard issue [19,
25]. However, we prefer calling these types of issues simple
(hard) and complex (soft) instead since these terms describe
these issues better. The hard issue is a simple ditto with few
and clear variables and the soft one is a complex adaptive
social system.
In order to clearly explain how we believe this is appli-
cable to software engineering we will use agile development
processes as an example. Studies have been conducted that
set out to investigate the social or cultural aspects of ag-
ile development. Withworth and Biddle [26], for example,
verifies that agile teams need to look at social-psychological
aspects to fully understand how they function. There are
also a set of studies connecting agile methods to organiza-
tional culture. These connect the agile adoption process to
culture to see if there are cultural factors that could jeopar-
dize the agile implementation, which there are [12, 24]. One
study divides culture in different layers depending on their
visibility according to Schein [20]. That paper shows that an
understanding of culture layers increase the understanding
of how an agile culture could be established [25].
If we want to investigate these types of issues we can look
at other fields that have been dealing with social science
for more than half a century. If we use humans and their
opinions in research we only investigate their perception of
what is happening in the organization. Even if this is the
case, we still need to check that items used to measure a
construct manage to circle it somehow, i.e. items that are
different but still correlated in relation to the construct un-
der investigation. In order to do this we need two things;
first, a reasonably large sample representative of the pop-
ulation (and large enough to remove individual and cohort
bias), and second, make sure our items are correlated and
pinpoint our construct of interest. The latter is often for-
gotten in software engineering research.
To simplify this explanation, let us look at a simple ex-
ample. If a test (e.g. a survey) gives the correlation matrix
in Table 1 the corresponding factor loadings would be the
ones showed in Table 2.
How to obtain the factors in a factor analysis is an ad-
vanced mathematical method and we will not go into details
on how the calculations are conducted. However, the main
reasoning behind the technique is that if pis the number of
variables (X1, X2, ..., Xp) and mis the number of underly-
ing factors (the model presumes we have underlying factors)
F1, F2, ..., Fmis that specific variable’s representation in the
Table 1: A correlation matrix.
A B C D E F
A'(reading) 1.00 .60 .50 .15 .20 .10
B'(vocabulary) 1.00 .50 .15 .10 .10
C'(spelling) 1.00 .10 .20 .15
D'(addition) 1.00 .60 .60
E'(subtraction) 1.00 .60
F'(multiplication) 1.00
Table 2: The corresponding factor matrix with fac-
tor loadings.
Factor'1 Factor'2
A'(reading) .70 .15
B'(vocabulary) .60 .15
C'(spelling) .60 .10
D'(addition) .10 .70
E'(subtraction) .10 .65
F'(multiplication) .05 .60
latent factors. Each measured (or observed) variable is then
a linear combination of the factors and reproduce maximum
correlations: Xj=aj1F1+aj2F2+, ..., +ajm Fm+ejwhere
j= 1,2, ..., p and aj1is the factor loading of the jth vari-
able on the first factor and so on. The factor loadings can
be seen as weights (or contrast) of a linear regression model
and tell us how much the variable has contributed to the
factor. There are different extraction techniques for factor
analysis and if the error variance is included it is called a
Principle Component Analysis since we use all variance to
find factors. If the error variance is excluded it is often called
a Principle Axis Factor extraction, but the output pattern
matrices are very similar [6].
In this case we probably had reason to believe that there
were two factors (or constructs) based on the variables in the
study. If, for example addition and subtraction were uncor-
related in our result, we would have reason to doubt our
measurement of the construct of mathematical skills. Even
if a construct like “mathematical skill” is ambiguous, we still
need to make sure our subset (like mathematical operators
–addition, subtraction, and multiplication–) are valid. In
this simple example we would have empirical support that
items A, B, and C describe one construct and D, E, and F
another, which also makes sense (i.e. high face validity). In
Section 3 we explain this method in more detail.
We can extend the same reasoning to that of “agility”.
Even if we do not have a clear definition of this term we can
still research agile practices or behavior. For example, if we
want to research Integration Testing or Retrospectives, we
must use items that are different but correlate in a satis-
factory way. The only more scientifically validated scale for
agile practices that we found was the article“Perceptive Ag-
ile Measurement: New Instruments for Quantitative Studies
in the Pursuit of the Social-Psychological Effect of Agile
Practices by So and Scholl [22], where they use the method
presented in this paper. As an example, their scale for “Ret-
rospectives” (assessed on a Likert scale from 1 –Never– to 7
–Always–) contains the items: (1) How often did you apply
retrospectives? (2) All team members actively participated
in gathering lessons learned in the retrospectives. (3) The
retrospectives helped us become aware of what we did well
in the past iteration/s. (4) The retrospectives helped us
become aware of what we should improve in the upcoming
iteration/s. (5) In the retrospectives we systematically as-
signed all important points for improvement to responsible
individuals. (6) Our team followed up intensively on the
progress of each improvement point elaborated in a retro-
spective.
These items were developed using two pretests and a val-
idation study with a sample of N=227. When running such
an analysis on other agile measurement tools, they often
show problems with validity since the quantitative valida-
tion step was disregarded in its development [10].
The whole discussion of “if we measure what we think we
measure”is dealt with in most papers in the Validity Threats
section. But what is validity? When it comes to tests in
human systems we cannot just look at the measurement tool
itself but also the context and interpretation of test scores,
like in the definition of validity by Messick [17]:
“Validity is not a property of the test or assess-
ment as such, but rather of the meaning of the
test scores. These scores are a function not only
of the items or stimulus conditions, but also of
the persons responding as well as the context
of the assessment. In particular, what needs to
be valid is the meaning or interpretation of the
score; as well as any implications for action that
this meaning entails.”
This means we always validate the usage of a test, and never
the test itself.
When soft issues are investigated in Psychology they are,
of course, analyzed using both quantitative and qualitative
data. However, after a more exploratory investigation (usu-
ally through qualitative case studies) we need to proceed
and collect empirical evidence of the phenomenon (or con-
struct) we found and see if numbers support our ideas. Of
course we need a holistic view of validity and studies us-
ing quantitative data sometimes get undeserved credibility
since the mathematical methods alone can seem advanced
and serious. However, this is also what we see as a danger
in software engineering survey research. If we create a sur-
vey and skip the statistical validation procedures we do not
know what we measure. The statistical validation is only
one aspect of the validation needed, but we should at least
make sure that part supports our hypotheses.
3. USEFUL STATISTICAL METHODS FOR
VALIDATION
There are many ways to categorize and list validity threats
in different fields. In this paper we will only present a few
aspect of validity where statistical tests can help us. These
are: (1) Reliability: Stability. (2) Reliability: Internal Con-
sistency. (3) Some aspects of Construct Validity.
The first in the list could quite easily be calculated by
conducting a test and a retest in the same context, and
then calculate the Pearson’s correlation coefficient for the
repeated measurements (which is a test for statistical de-
pendence between two random variables or two data sets).
If the first test is strongly correlated to the second (close to
1) we have some evidence that our test is stable.
For testing the internal consistency of a scale, the most
common method used is Cronbach’s α[5]. Without provid-
ing a mathematical definition, the αcan be seen as an overall
correlation coefficient for a set of items. The αis a function
of the number of items in a test, the mean covariance be-
tween item pairs, and the variance of the overall total score.
The idea is that if a set of items is meant to measure a cer-
tain construct, the included items must be correlated. The
Conbach’s αcan be used as a step in an Exploratory Factor
Analysis (EFA), which is a way to test aspects of Construct
Validity.
The first step of a EFA is meant to investigate underlying
variables in data (i.e. what factors explain most of the vari-
ance orthogonally). The next step is to rotate the factors
and group them if they correlate and explain much of the
same variance (i.e. the factors in a scale should not correlate
too much or too little if they are considered to explain and
measure a construct). A factor analysis is a statistical help
to find groups of variables that explain certain construct in
a data set. For more details, see e.g. [6].
The first thing to do when conducting a factor analysis
is to make sure the items have the assumptions fulfilled for
such a method, i.e. they need to be correlated to each other
in a way that they can measure the same concept. Testing
the Kaiser-Meyer-Olkin Measure of Sampling Adequacy and
Bartlett’s Test of Sphericity is a way to do this. Sphericity
is how much the items (an overall test) are dependent on
each other, i.e. the null hypothesis is that the items form an
identity matrix (no correlations between items and there-
fore not suitable for a factor analysis). If this test is signif-
icant we can proceed with our analysis. The Kaiser-Meyer-
Olkin Measure of Sampling Adequacy tests how correlated
the items are, and if the value is under <.5 they are unac-
ceptable as a rule of thumb and items with low correlations
to the others need to be removed. Suitable items to remove
can be selected based on such correlations from an Anti-
Image table. If factors are removed and an acceptable value
is obtained, we can create a Factor Matrix with our new
suggested factors (again, for more details see e.g. [6]). The
extraction is often based on Eigenvalues >1 and the first
factor will usually explain more variance than the following
ones. This is because the algorithm tries to find a set of
items that explain as much of the variance as possible, and
then the second factors does that same thing for variance
not explained by the first factor, and so on and so forth.
We would also like to state that the factor rotation can be
done based on the assumption that the items do are not de-
pendent on each other (an orthogonal rotation) or that they
could be dependent (an oblique rotation). In an oblique ro-
tation we get information of how internally correlated the
items in the different factors are. If the results are unsat-
isfactory, they could of course be used to reorganize factors
and improve/rethink the included items.
Getting “enough” data is always a tricky aspect of statis-
tical tests, and maybe more so in software engineering than
management or psychology. When it comes to factor analy-
sis, the sample size needed is dependent on e.g. communal-
ities between, and over-determination of, factors. Commu-
nality (or Internal Consistency) is the joint variables’ possi-
bility to explain variance in a factor, which can be calculated
by the Cronbach’s αmentioned above. Over-determination
of factors is how many factors are included in each variable
[16].
4. DISCUSSION
To summarize, the method presented above is one way
of checking if the numbers support the idea that a test re-
ally measures what we hope it does. This will not replace
other aspects of validity but we do not see any disadvan-
tages with collecting empirical evidence for surveys used in
research. In the field of e.g. psychology, researchers need to
be very careful stating that surveys, that have not been sci-
entifically validated, give any kind of evidence for a certain
research hypothesis. We believe the field of software engi-
neering should be as careful when using poorly validated
tools both in research and practice.
Of course we realized that getting a lot of data can be
cumbersome. Sometimes we need to do as well as we can
given small samples and scarce information. However, the
research community should be aware of the drawbacks and,
at least, use statistical validation techniques when applicable
to a data set.
The main hinder for using such statistical methods is that
the knowledge of statistics are generally lower than what is
needed to understand the methods presented here. This is a
dilemma in all social science research and we understand its
implications, but we believe some more training in applied
statistics and method (i.e. psychometrics) in software en-
gineering education could somewhat reduce this knowledge
gap. In addition, all tests we have presented here are imple-
mented in most statistical software (including open source
alternatives) and what is needed is to interpret the output.
5. REFERENCES
[1] D. J. Bartholomew. Stochastic models for social
processes. Wiley, London, 1967.
[2] A. Bosu, J. Carver, R. Guadagno, B. Bassett,
D. McCallum, and L. Hochstein. Peer impressions in
open source organizations: A survey. Journal of
Systems and Software, 94:4–15, 2014.
[3] G. E. P. Box and N. R. Draper. Empirical
Model-building and Response Surface. John Wiley &
Sons, Inc., New York, NY, USA, 1986.
[4] J. Chen, J. Xiao, Q. Wang, L. J. Osterweil, and M. Li.
Perspectives on refactoring planning and practice: an
empirical study. Empirical Software Engineering,
pages 1–40, 2015.
[5] L. Cronbach. Coefficient alpha and the internal
structure of tests. Psychometrika, 16(3):297–334, 1951.
[6] L. Fabrigar and D. Wegener. Exploratory Factor
Analysis. Series in understanding statistics. OUP
USA, 2012.
[7] D. M. Fern´andez and S. Wagner. Naming the pain in
requirements engineering: A design for a global family
of surveys and first results from germany. Information
and Software Technology, 57:616–643, 2015.
[8] M. T. Frohlich and R. Westbrook. Arcs of integration:
an international study of supply chain strategies.
Journal of operations management, 19(2):185–200,
2001.
[9] B. Glaser and A. Strauss. The discovery of grounded
theory: Strategies for qualitative research. Aldine
Transaction (a division of Transaction Publishers),
New Brunswick, N.J., 2006.
[10] L. Gren, R. Torkar, and R. Feldt. The prospects of a
quantitative measurement of agility: A validation
study on an agile maturity model. Journal of Systems
and Software, 107:38–49, 2015.
[11] N. Hanebutte, C. S. Taylor, and R. R. Dumke.
Techniques of successful application of factor analysis
in software measurement. Empirical Software
Engineering, 8(1):43–57, 2003.
[12] J. Iivari and N. Iivari. The relationship between
organizational culture and the deployment of agile
methods. Information and Software Technology,
53(5):509–520, 2011.
[13] I. Izquierdo, J. Olea, and F. J. Abad. Exploratory
factor analysis in validation studies: Uses and
recommendations. Psicothema, 26(3):395–400, 2014.
[14] B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard,
P. W. Jones, D. C. Hoaglin, K. El Emam, and
J. Rosenberg. Preliminary guidelines for empirical
research in software engineering. IEEE Transactions
on Software Engineering, 28(8):721–734, 2002.
[15] R. Kurzweil. How to create a mind: the secret of
human thought revealed. Penguin Books, New York,
N.Y., 2013.
[16] R. C. MacCallum, K. F. Widaman, S. Zhang, and
S. Hong. Sample size in factor analysis. Psychological
methods, 4:84–99, 1999.
[17] S. Messick. Validity of psychological assessment:
validation of inferences from persons’ responses and
performances as scientific inquiry into score meaning.
American psychologist, 50(9):741, 1995.
[18] M. B. Miles and A. M. Huberman. Qualitative data
analysis: a sourcebook of new methods. Sage, Beverly
Hills, 1984.
[19] P. Ranganath. Elevating teams from ’doing’ agile to
’being’ and ’living’ agile. In Agile Conference
(AGILE), 2011, pages 187–194, Aug 2011.
[20] E. Schein. Organizational culture and leadership.
Jossey-Bass, San Francisco, 4 edition, 2010.
[21] P. Serrador and J. K. Pinto. Does Agile work? – A
quantitative analysis of agile project success.
International Journal of Project Management,
33(5):1040–1051, July 2015.
[22] C. So and W. Scholl. Perceptive agile measurement:
New instruments for quantitative studies in the
pursuit of the social-psychological effect of agile
practices. In Agile Processes in Software Engineering
and Extreme Programming, pages 83–93. Springer,
2009.
[23] B. Thompson and L. G. Daniel. Factor analytic
evidence for the construct validity of scores: A
historical overview and some guidelines. Educational
and psychological measurement, 56(2):197–208, 1996.
[24] C. Tolfo and R. Wazlawick. The influence of
organizational culture on the adoption of extreme
programming. Journal of systems and software,
81(11):1955–1967, 2008.
[25] C. Tolfo, R. Wazlawick, M. Ferreira, and F. Forcellini.
Agile methods and organizational culture: Reflections
about cultural levels. Journal of Software Maintenance
and Evolution: Research and Practice, 23(6):423–441,
2011.
[26] E. Whitworth and R. Biddle. The social nature of
agile teams. In Agile Conference (AGILE), 2007,
pages 26–36. IEEE, 2007.
[27] C. Wohlin. Are individual differences in software
development performance possible to capture using a
quantitative survey? Empirical Software Engineering,
9(3):211–228, 2004.