Content uploaded by Jesús Moreno-León
Author content
All content in this area was uploaded by Jesús Moreno-León on Feb 28, 2016
Content may be subject to copyright.
Comparing Computational Thinking Development
Assessment Scores with Software Complexity
Metrics
Jesús Moreno-León
Programamos.es
Sevilla, Spain
jesus.moreno@programamos.es
Gregorio Robles
Universidad Rey Juan Carlos
Madrid, Spain
grex@gsyc.urjc.es
Marcos Román-González
Universidad Nacional de Educación
a Distancia
Madrid, Spain
mroman@edu.uned.es
Abstract—The development of computational thinking skills
through computer programming is a major topic in education, as
governments around the world are introducing these skills in the
school curriculum. In consequence, educators and students are
facing this discipline for the first time. Although there are many
technologies that assist teachers and learners in the learning of
this competence, there is a lack of tools that support them in the
assessment tasks. This paper compares the computational
thinking score provided by Dr. Scratch, a free/libre/open source
software assessment tool for Scratch, with McCabe’s Cyclomatic
Complexity and Halstead’s metrics, two classic software
engineering metrics that are globally recognized as a valid
measurement for the complexity of a software system. The
findings, which prove positive, significant, moderate to strong
correlations between them, could be therefore considered as a
validation of the complexity assessment process of Dr. Scratch.
Keywords—computational thinking, Scratch, programming,
assessment tools, Dr. Scratch, software metrics, complexity
INTRODUCTION
In recent times, the development of computational thinking
skills is a major topic of interest among educational
institutions, politics and academia [1]. All over the world,
initiatives are spreading to promote this skill through
computer programming from an early age, both in formal and
informal environments [2]. There are many technologies that
have been created in order to assist learners in the learning of
these skills, such as the historic Logo [3], or the more recent
Scratch [4] or Alice [5] programming languages, the
introduction of robots in the classroom [6], or the availability
of affordable hardware devices such as Raspberry Pi or
Arduino [7]. However, the number of learning assessment
tools that support learners and educators in the development of
their computational thinking skills is still a field to be
developed conveniently. With our contribution we want to
shed some light into how Dr. Scratch [8], a free/libre/open
source software assessment tool for Scratch, correlates with
other, classic software engineering complexity metrics, and
provide a perspective of how these tools should be designed to
maximize support to teachers and learning by learners.
The paper is structured as follows. In Section II we justify
the need for tools that support learners and educators in the
assessment and development of computational thinking skills,
present the inner working of Dr. Scratch and other classic
software engineering complexity metrics, and review literature
comparing several software metrics. Section III explains the
procedure we have followed in this investigation, presenting
the implementation of classic metrics for Scratch, specifically
McCabe’s cyclomatic complexity and Halstead’s measures,
and justifying the selection of the Scratch projects that have
been analyzed to compare the complexity score by Dr. Scratch
with those obtained by using the traditional complexity
metrics. Section 4 presents the results of the study, discussing
the significance of the correlations detected between metrics.
Finally, Section 5 includes the conclusions of our study and
ideas for future work.
BACKGROUND
In the last years, governments around the world have
introduced computer programming in their national or regional
educational curriculum to develop computational thinking
skills of students [9, 10]. In K-12, the most common
instrument to develop these skills is the use of visual
programming languages based on blocks [11]. Scratch is
undoubtedly the most used language in this educational
environment. In the moment of writing this paper, the Scratch
statistics1 indicate that there are more than 8 million registered
users, and over 11 million projects are publicly shared in the
Scratch web repository.
Despite the obvious success of these tools in bringing
programming and computational thinking to young people,
there are studies that have discovered some habits of
programming in students learning to program with these
platforms that are contrary to accepted practice in computer
science [12, 13]. This situation is due to the lack of tools that
support both learners and educators in the assessment of their
learning.
In order to amend this situation, the authors created Dr.
Scratch, a web tool that analyses Scratch projects to offer
feedback to educators and learners and assigns a
computational thinking score to the projects. Its effectiveness
to foster computational thinking skills has been proved [8], as
learners use the feedback provided by the tool to improve their
programs. Learners also realize with the help of Dr. Scratch
1 https://scratch.mit.edu/statistics/
how to improve their programming skills. Dr. Scratch is being
used both by students and teachers from different educational
levels, but also by organizations with programming initiatives
that need data to evaluate their projects’ effectiveness [14].
The computational thinking score assigned by Dr. Scratch,
which ranges from 0 to 21 points, is based on the degree of
development of different dimensions of the computational
thinking competence, such as abstraction and problem
decomposition, logical thinking, synchronization, parallelism,
algorithmic notions of flow control, user interactivity and data
representation, which are statically evaluated by inspecting the
source code of the analysed project [15]. It is therefore a
complexity value that could be compared with other, classic
software engineering metrics used to measure the complexity
of a program.
The cyclomatic complexity (CC) is a graph-theoretic
complexity measure that can be used to manage and control
program complexity [16]. This metric is based on the number
of linear independent paths in a program, and can be used to
establish the number of test cases in the basis path testing
methodology [17]. Halstead’s metrics identify certain
properties of a program that can be measured and the
relationships between them to assess software complexity
[18]. These metrics have been widely used in software
engineering to estimate maintenance efforts and guide
software testing by identifying complex, hard to maintain
modules [19].
These software complexity metrics have been extensively
investigated and compared in the last decades. Henry, Kafura
and Harris [20] reported the results of correlation studies made
among three complexity metrics applied to 165 procedures in
the UNIX operating system, detecting high correlations
between CC and Halstead’s metrics (correlation coefficient
CC-H’s length = 0.9145, correlation coefficient CC-H’s effort
= 0.8411). Li and Cheung [21] compared 31 metrics by
analyzing 255 FORTRAN programs and detected strong
correlations between them (correlation coefficient CC-H’s
vocabulary = 0.886). More recently, Zhang and Badoo [22]
conducted an investigation to compare the performance of
different complexity metrics using data from Eclipse JDT
open source project detecting a significant correlation between
CC and Halstead’s effort (correlation coefficient = 0.6965).
These results confirm a certain consistency between these
metrics. In consequence, being the Dr. Scratch score a value
that indicates some kind of complexity in Scratch code, this
investigation aims to find correlations between CC, Halstead’s
metrics and Dr. Scratch score by analyzing 100 projects of the
Scratch repository.
METHODOLOGY
A. Implementation of the metrics analyzer for Scratch
projects
To implement the CC and Halstead’s analyzer for Scratch
projects, the authors have developed a Hairball plug-in called
metrics2. Hairball [23] is a static code analyzer for Scratch
projects that is programmed in Python and allows adding plug-
ins to perform new types of analysis.
2 https://github.com/jemole/hairball/blob/master/hairball/plugins/metrics.py
The implementation of the CC is based on the number of
decision points, as McCabe showed that this number plus 1 is
equal to the number of basic paths through a structured
program [16]. The calculation of the Halstead’s metrics are
based on the number of distinct operators, the number of
distinct operands, the total number of operators and the total
number of operands of the program, denoted as n1, n2, N1 and
N2, respectively [18]:
•Vocabulary: the total number of distinct operators
and operands. n = n1 + n2
•Length: the sum of all necessary tokens for the
computation of the program. N = N1 + N2
•Volume: the number of bits necessary to represent
the program .V = N * log2 n
•Difficulty: is used to compare different
implementations of the same algorithm. The
longer the implementation, the higher the
difficulty. D = n1/2 * N2/n2
•Effort: effort required to understand or create a
program. E = D * V
Table I presents the results of calculating both metrics for
the Scratch program showed in Fig. 1. CC is 3, as there are
two decision points ('repeat until %s%s', 'if %s then%s'). The
values for the Halstead’s metrics derive from the operators and
operands lists, being n1=9, N1=9, n2=6, N2=8:
•Operators: 'x pos' (*3), 0, 'Enemy', 10, 200, 'I
win!!’
•Operands: 'touching %s?', 'set %s to %s', 'say
%s', 'change %s by %s', 'if %s then%s', '%s >
%s', 'repeat until %s%s', '%s', 'when @greenFlag
clicked'
B. Study sample
In a previous investigation [24], the authors detected
Fig.1: Scratch program
TABLE I. CYCLOMATIC COMPLEXITY AND HALSTEAD METRICS FOR
THE SCRATCH PROJECT IN FIGURE 1
Metric Value
Cyclomatic complexity 3
Vocabulary 14
Length 14
Volume 66.42
Difficulty 53.30
Effort 293.86
differences in the score of the computational thinking
dimensions measured by Dr. Scratch among several kinds of
Scratch projects. For instance, storytelling projects tended to
score very low in terms of logical thinking, as stories have a
linear structure with less branches and decision points . This is
congruent with an investigation that compared the
programming skills fostered when coding different types of
projects [25].
In consequence, aiming to select projects with a broad
range of Dr. Scratch scores, we downloaded different types of
Scratch projects. As projects in the Scratch repository are
tagged by categories, we randomly selected 25 projects of
each of the following categories: stories, animations, games
and art creations. Table II shows the characteristics of the
study sample regarding the computational thinking score
assigned by Dr. Scratch. Although 100 projects were
downloaded, 5 of them produced an error while being
analyzed, which limits the sample size to 95 projects. The
mean score was 13.75, while both median and mode were 15.
Fig. 2 shows the frequency histogram of the computational
thinking scores assigned by Dr. Scratch. As can be seen, there
is a broad range of scores, with projects from 5 to 20 points,
although there is a majority of projects rated with 14 o more
points.
TABLE II. DR. SCRATCH COMPUTATIONAL THINKING SCORE OF
THE ANALYZED PROJECTS
Fig. 2: Frequency histogram of Dr. Scratch CT Score of the analyzed
projects
FINDINGS
Table III shows the descriptive statistics of the analyzed
projects for the different metrics. As can be seen, for this study
the only Halstead measurements that have been taken into
account are Vocabulary and Length. Other values, such as
Volume or Effort, are nonlinear, and would thus require a
different statistical analysis. Nonetheless, given the goals of
this investigation, Vocabulary and Length are considered
sufficient to find correlations between Dr. Scratch and
Halstead’s metrics.
Table IV shows the correlations between each of the
metrics. As shown, Dr. Scratch computational thinking score
has a positive, significant, moderate to strong correlation with
both CC and Vocabulary, while has a positive, significant
moderate correlation with Length. This indicates that the
complexity measurement of Dr. Scratch is in line with other,
classic software engineering complexity metrics, which could
be considered as a validation of the complexity assessment
process of the tool.
Congruently with the reviewed literature, Halstead’s
metrics and McCabe’s CC also have a positive, significant
strong correlation for the sample of analyzed Scratch projects.
As expected, the correlation between Halstead’s Vocabulary
and Length is also positive, significant and strong.
TABLE III. MEAN AND STANDARD DEVIATION FOR EACH METRIC
TABLE IV. CORRELATION BETWEEN METRICS
** Significant correlation 0.01 level (bilateral)
Fig. 3 shows the scatter plot of Dr. Scratch computational
thinking score and CC with its best fitting line. The coefficient
of determination is r2 = .385, which indicates that 38.5% of the
variance of Dr. Scratch CT score of a project can be predicted
from its CC value. As can be seen, the relationship between
these two variables is better represented by the linear model
for values under 16 Dr. Scratch CT points. However, for those
projects with a higher score, the model is not as accurate.
A similar situation is depicted in Fig. 4, where the scatter
plot of Dr. Scratch computational thinking score and
Halstead’s Vocabulary with its best fitting line is shown. In
this case r2 = .461, which states a better accuracy of the linear
model than the prior one. However, it is clear that for those
projects with a Dr. Scratch CT score over 16 points, the model
does not represent the relationship between the metrics with
the same accuracy.
Fig. 5 presents the scatter plot of Dr. Scratch
computational thinking score and Halstead’s Length with its
best fitting line, where r2 = .319. As shown in the picture, in
general, the projects with a Dr. Scratch CT score fewer than 16
points are placed much closer to the best fitting line than those
with a higher score.
Fig. 3: Scatter plot of Dr. Scratch CT Score and Cyclomatic Complexity with
its best fitting line. r2 = .385.
Fig. 4. Scatter plot of Dr. Scratch CT Score and Halstead’s Vocabulary with its
best fitting line. r2 = .461.
The fact that the same phenomenon has been detected in
the three scatter plots seems to indicate that the range of Dr.
Scratch CT scores, from 0 to 21 points, could not be flexible
enough to represent the differences of complexity when
dealing with the most complex projects, and, consequently,
there seems to be room to extend the range of scores of this
assessment tool.
Dr. Scratch CT score is calculated by summing the partial
scores assigned to several CT dimensions: abstraction and
problem decomposition, logical thinking, algorithmic notions
of flow control, synchronization, parallelism, data
representation and user interactivity. Each of these dimensions
Fig. 5. Scatter plot of Dr. Scratch CT Score and Halstead’s Length with its
best fitting line. r2 = .319
is assigned a score between 0 and 3 points based on the degree
of development found in the source code. In light of the results
obtained, it seems plausible that an increment in the range of
the evaluation of these dimensions could enhance the
correlation with classic complexity metrics, as differences
between complex projects could therefore be better
represented.
In this regard, table V shows the correlations between each
of the CT dimensions taken into account in Dr. Scratch CT
score, and CC and Halstead’s metrics. The results indicate that
the dimensions that seem to have a higher impact on software
complexity are date representation, synchronization and
logical thinking, as the correlation coefficients are the highest
in the three cases. On the contrary, user interactivity and flow
control are the dimensions that present lower correlations and,
therefore, seem to have a reduced influence on the complexity
of the Scratch programs. An explanation for these results
would require a thorough study that is out of the scope of this
investigation. Future research will contribute to understand
these relationships.
CONCLUSIONS AND FUTURE WORK
As the introduction of computational thinking in the school
curricula is becoming mainstream worldwide, a vast majority
of educators and learners face this discipline for the first time.
In consequence, there is need for tools that support both
students and teachers in the assessment of the learning of this
competence.
Dr. Scratch allows learners to evaluate their projects to
receive a computational thinking score, so they can discover
their own degree of development of this ability. In addition,
the tool offers a gamified feedback report with ideas and tips
to improve the code, aiming to encourage students’ desire to
keep on developing their programming skills. On the other
hand, Dr. Scratch also supports teachers in the assessment
tasks, saving time and offering a complete report that could be
used to identify specific issues in an automatic way.
Tools like Dr. Scratch are extremely welcome by the
educational community, but the assessment process must be
externally validated in order to provide useful feedback. Thus,
this investigation shows the correlation of the Dr. Scratch CT
score and McCabe’s Cyclomatic Complexity and Halstead’s
TABLE V. CORRELATION BETWEEN DR. SCRATCH CT DIMENSIONS AND
MCCABE’S CC AND HALSTEAD’S METRICS
metrics. These classic software engineering complexity
metrics are globally recognized as a valid measurement for the
complexity of a software system. The positive, significant,
moderate to strong correlations, as well as the linear model
that represent the relationships between the metrics that have
been described in this paper could be therefore considered as a
validation of the complexity assessment process of Dr. Scratch
Nevertheless, the comparison between metrics performed
in this investigation is a first step. Future research could
extend the scope of the study by thoroughly analyzing
correlations and linear models between these classic software
engineering metrics and each of the dimensions involved in
the Dr. Scratch CT score. In addition, being CC and Halstead’s
metrics used in the industry to estimate effort of maintenance,
in terms of potential bugs or problematic software modules,
these measurements could also be used to try to infer the
number of issues in Scratch projects, such as dead code or
repeated programs.
The work presented in this paper should be seen as a first
step in the validation process of the tool. Thus, in the
environment of a Scratch contest for primary and secondary
students, at this moment we are comparing the Dr. Scratch CT
score with the assessments provided by a panel of experts with
several years of experience in the evaluation of Scratch
projects who formed the jury of the contest.
Consequently, future research could contribute in the
design and enhancement of assessment tools to support
educators and pupils in the development of computational
thinking in schools in a standardized and externally validated
manner.
ACKNOWLEDGMENT
The work of all authors has been funded in part by the
Region of Madrid under project “eMadrid - Investigación y
Desarrollo de tecnologías para el e-learning en la Comunidad
de Madrid” (S2013/ICE-2715). The authors are very thankful
to Eva Hu Garres and Mari Luz Aguado for their technical
support with Dr. Scratch.
REFERENCES
[1] Lye, S. Y., & Koh, J. H. L. “Review on teaching and learning of compu -
tational thinking through programming: What is next for K-12?”. Com -
puters in Human Behavior, 41, 51-61, 2014.
[2] J. Moreno-León and G. Robles. “The Europe Code Week (CodeEU) ini-
tiative: shaping the skills of future engineers”. In Global Engineering
Education Conference (EDUCON), pages 561-566. IEEE, 2015.
[3] S. Papert (1980). Mindstorms: Children, computers, and powerful ideas.
Basic Books, Inc.
[4] M. Resnick, J. Maloney, A. Monroy-Hernández, N. Rusk, E. Eastmond,
K. Brennan, A. Millner,E. Rosenbaum, J. Silver, B. Silverman, et al.
“Scratch: Programming for all”. Communications of the ACM,
52(11):60–67, 2009.
[5] C. Stephen, W. Dann, and R. Pausch. "Alice: a 3-D tool for introductory
programming concepts." Journal of Computing Sciences in Colleges.
Vol. 15. No. 5. Consortium for Computing Sciences in Colleges, 2000.
[6] F. B. V. Benitti. Exploring the educational potential of robotics in
schools: A systematic review. Computers & Education, 58(3):978–988,
2012.
[7] S. Jaroslav, P. Balda, and M. Schlegel. "Raspberry Pi and Arduino
boards in control education." Advances in Control Education. Vol. 10.
No. 1. 2013.
[8] J. Moreno-León and G. Robles. "Dr. Scratch: Automatic Analysis of
Scratch Projects to Assess and Foster Computational Thinking." RED.
Revista de Educación a Distancia (46), 2015.
[9] European Schoolnet. “Computing our future. Computer programming
and coding - Priorities, school curricula and initiatives across Europe” .
Technical Report European Schoolnet, 2015. URL:
http://www.eun.org/publications/detail?
publicationID=661
[10] S. Grover and R. Pea. Computational thinking in K–12. A review of the
state of the field. Educational Researcher, 42(1):38–43, 2013.
[11] D. Weintrop and U. Wilensky. “To block or not to block, that is the
question: students’ perceptions of blocks-based programming”. In Proc.
of the 14th Annual IDC Conference (Boston, MA), 2015.
[12] O. Meerbaum-Salant, M. Armoni, and M. Ben-Ari. "Habits of program-
ming in scratch." Proceedings of the 16th annual joint conference on In-
novation and technology in computer science education. ACM, 2011.
[13] J. Moreno, and G. Robles. "Automatic detection of bad programming
habits in scratch: A preliminary study." Frontiers in Education Confer-
ence (FIE), 2014 IEEE (pp. 1-4). IEEE, 2014.
[14] J. Moreno-León and G. Robles. "Dr. Scratch: a Web Tool to Automati -
cally Evaluate Scratch Projects." Proceedings of the Workshop in Pri -
mary and Secondary Computing Education (pp. 132-133). ACM, 2015.
[15] J. Moreno-León and G. Robles. "Analyze your Scratch projects with Dr.
Scratch and assess your computational thinking skills." Scratch Confer-
ence (pp. 12-15). 2015.
[16] T.J. McCabe. "A complexity measure." Software Engineering, IEEE
Transactions on 4: 308-320. 1976.
[17] A.H. Watson, T.J. McCabe, and D. R. Wallace. "Structured testing: A
testing methodology using the cyclomatic complexity metric." NIST
special Publication 500.235: 1-114., 1996.
[18] Halstead, Maurice H. Elements of Software Science (Operating and pro-
gramming systems series). Elsevier Science Inc., 1977.
[19] Kafura, D., & Reddy, G. R. (1987). The use of software complexity met-
rics in software maintenance. IEEE Transactions on Software Engineer-
ing, (3), 335-343.
[20] S. Henry, D. Kafura, and K. Harris. "On the relationships among three
software metrics." ACM SIGMETRICS Performance Evaluation Re-
view. Vol. 10. No. 1. ACM, 1981.
[21] H. F. Li and W. Kwok Cheung. "An empirical study of software met -
rics." Software Engineering, IEEE Transactions on 6: 697-708, 1987.
[22] M. Zhang and N. Baddoo. "Performance comparison of software com-
plexity metrics in an open source project." Software Process Improve-
ment. Springer Berlin Heidelberg, pp 160-174, 2007.
[23] B. Boe, et al. "Hairball: Lint-inspired static analysis of scratch projects."
Proceeding of the 44th ACM technical symposium on Computer science
education. ACM, 2013.
[24] J. Moreno-León and G. Robles. “Computer programming as an
educational tool in the English classroom: A preliminary study”. In
Global Engineering Education Conference (EDUCON), 2015 IEEE,
pages 961-966. IEEE, 2015.
[25] J. C. Adams and A. R. Webster. “What do students learn about
programming from game, music video, and storytelling projects?” In
Proceedings of the 43rd ACM technical symposium on Computer
Science Education, pages 643–648. ACM, 2012.