ArticlePDF Available

Explaining spreadsheets with spreadsheets (short paper)

Authors:

Abstract and Figures

Based on the concept of explanation sheets, we present an approach to make spreadsheets easier to understand and thus easier to use and maintain. We identify the notion of explanation soundness and show that explanation sheets which conform to simple rules of formula coverage provide sound explanations. We also present a practical evaluation of explanation sheets based on samples drawn from widely used spreadsheet corpora and based on a small user study. In addition to supporting spreadsheet understanding and maintenance, our work on explanation sheets has also uncovered several general principles of explanation languages that can help guide the design of explanations for other programming and domain-specific languages.
Content may be subject to copyright.
Explaining Spreadsheets with Spreadsheets
(Short Paper)
Jácome Cunha
University of Minho & NOVA LINCS,
Portugal
jacome@di.uminho.pt
Mihai Dan
Oregon State University, USA
danm@oregonstate.edu
Martin Erwig
Oregon State University, USA
erwig@oregonstate.edu
Danila Fedorin
Oregon State University, USA
fedorind@oregonstate.edu
Alex Grejuc
Oregon State University, USA
grejuca@oregonstate.edu
Abstract
Based on the concept of explanation sheets, we present an
approach to make spreadsheets easier to understand and
thus easier to use and maintain. We identify the notion of
explanation soundness and show that explanation sheets
which conform to simple rules of formula coverage provide
sound explanations. We also present a practical evaluation
of explanation sheets based on samples drawn from widely
used spreadsheet corpora and based on a small user study.
In addition to supporting spreadsheet understanding and
maintenance, our work on explanation sheets has also un-
covered several general principles of explanation languages
that can help guide the design of explanations for other pro-
gramming and domain-specic languages.
CCS Concepts Software and its engineering
Do-
main specic languages;Software maintenance tools;
Keywords
Software Understanding, Explanation Principles
ACM Reference Format:
Jácome Cunha, Mihai Dan, Martin Erwig, Danila Fedorin, and Alex
Grejuc. 2018. Explaining Spreadsheets with Spreadsheets (Short
Paper). In Proceedings of the 17th ACM SIGPLAN International Con-
ference on Generative Programming: Concepts and Experiences (GPCE
This work is partially supported by the National Science Foundation under
the grant CCF-1717300. Work nanced by European Regional Development
Fund through the Operational Programme for Competitiveness and Interna-
tionalization COMPETE 2020 Programme and by National Funds through
the Portuguese funding agency FCT project POCI-01-0145-FEDER-016718
and NOVA LINCS UID/CEC/04516/2013, and by FLAD project 233/2014.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
GPCE ’18, November 5–6, 2018, Boston, MA, USA
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6045-6/18/11. . . $15.00
hps://doi.org/10.1145/3278122.3278136
’18), November 5–6, 2018, Boston, MA, USA. ACM, New York, NY,
USA, 7pages. hps://doi.org/10.1145/3278122.3278136
1 Introduction
Studies estimate that software maintenance costs make up
60% [
32
] to 80% [
33
] of the total cost of software over its
life cycle. Changing a program or even just using it requires
a user to understand it [
30
], and that takes time and eort.
Research has shown that software developers spend most of
their time understanding source code [34].
Program understanding gets even more complicated when
a piece of software changes its owner frequently. Studies of
spreadsheet usage found that 85% of the study participants
did not create the spreadsheets they had to work on them-
selves but received them from their colleagues [
25
]. The
same paper reports that 70% of those users have diculties
understanding them and spend hours browsing them.
To address some of these challenges, Kankuzi extensively
studied the mental models of spreadsheets users [
27
]. He ad-
vocates techniques and tools that reect these mental models
such as abstracting implementation details by replacing cell
references in formulas through user-dened names. Other
approaches in this regard have tried to support the main-
tenance of spreadsheets by systematizing their evolution
[
16
18
]. However, these techniques face an uphill battle,
since they depend on the widespread adoption of new lan-
guages and tools to be eective. Moreover, it is dicult to
support legacy spreadsheets in this way.
Therefore, we propose to augment spreadsheets with ex-
planations, which can help users better understand and thus
use and maintain spreadsheets. This approach allows users
to continue to work with their current spreadsheet system,
and it can be retroactively applied to existing spreadsheets.
In the following, we illustrate our approach with an example.
When people try to understand a spreadsheet, they com-
monly face the time-consuming and error-prone task of re-
solving cell references to make sense of what formulas in
a spreadsheet do. This task is often exacerbated in spread-
sheets which contain a distracting and overwhelming volume
of data and by the fact that referenced cells often contain
161
GPCE ’18, November 5–6, 2018, Boston, MA, USA J. Cunha, M. Dan, M. Erwig, D. Fedorin, and A. Grejuc
Figure 1. Payroll Spreadsheet
formulas that reference other cells in dierent parts of the
spreadsheet.
This process of “reference chasing” in spreadsheets is nec-
essary, since references are specied using row and column
indexes, which do not provide any information about the
value or meaning of the referenced cell. Therefore, to make
sense of what a reference represents, the user has to scan the
spreadsheet for the indicated row and column. This may re-
quire a user to scroll a long distance and in many cases jump
across multiple cells to make sense of the initial reference.
We address this problem through the use of label abstrac-
tion in our spreadsheet explanation model where possible
labeling information is used to replace raw index references
with the values of cells that label them. In doing so, formulas
that use references become more clear.
Consider Figure 1, which shows the formula view of a
spreadsheet with payroll information for employees within
a company. The spreadsheet was adapted from a study on
spreadsheet error detection and correction [
8
] and is a sim-
plied version of a real world scenario.
Applying label abstraction to the payroll spreadsheet re-
sults in the spreadsheet shown in Figure 2. Label abstraction
increases readability by giving context to computation, es-
pecially as formula complexity increases.
We can observe that while the values in rows 3-6 dier,
the formulas with labels are all identical. Such a pattern
occurs quite frequently. The redundancy that results from
the repetition is rather distracting. On the other hand, the
dierent values in the dierent rows do not contribute much
to the understanding of the represented computation.
Therefore, we apply another transformation to obtain an
explanation of the original spreadsheet that represents re-
peated (groups of) rows by a single (group of) row(s). The
result of this compression, which we call a zoom, can be seen
in Figure 2where rows 3-6 from the original spreadsheet
have been compressed into one row. Such compressed rows
contain two kinds of information. First, columns (such as E-
G) that contain in the uncompressed sheet a single repeated
formula contain just that single formula. Second, columns
(such as A-D) with dierent values in dierent rows contain
a range that captures all values found in the respective col-
umn. The same zoom compression technique can, of course,
be applied to repeated columns.
Note that both transformations preserve the essential
structure and the key computing elements of the original
spreadsheet. We call a spreadsheet that is the result of label
abstraction and zoom compression an explanation sheet.
In the remainder of this paper, we rst present some gen-
eral design principles for explanations in Section 2. Based on
these principles, we have developed our spreadsheet expla-
nation model, which is formalized in Section 3. In Section 4
we present a preliminary artifact evaluation, followed by a
small user study in Section 5. We discuss related work in
Section 6and present some conclusions in Section 7.
2 Explanation Principles
Informed in part by previous work on explanations [
20
,
35
],
but also by the experience during the creation of spreadsheet
explanations (see Section 4) we have identied a number of
general principles that we believe should guide the develop-
ment of explanations. We present these principles here in a
separate section, so that they can also inform the design of
explanations for other languages.
In general, explanations can take on many dierent forms.
Taking a programming language perspective, we have found
it useful to conceptualize an explanation system as consisting
of two languages: (A) the language whose programs are to
be explained and (B) the language in which explanations
are expressed. We call the former the subject language and
the latter the explanation language. Correspondingly, we
call programs of the subject language subject programs (or
programs for short), and we call programs of the explanation
language explanation programs (or explanations for short). In
the context of spreadsheet explanations, this means that we
refer to spreadsheets sometimes as subject (spread)sheets and
that we call their explanations explanation (spread)sheets.
The following four principles have emerged as guidelines
for the design of spreadsheet (and other) explanations.
(1) Structure Preservation
.An explanation language
should retain key subject language structures. Subject lan-
guage structures can provide easy access to an explana-
tion, since users are already familiar with these structures.
Moreover, reused structures facilitate the alignment of
explanations with subject programs.
(2) Abstraction
.An explanation language should aim at
high-level descriptions that abstract from details of the
subject language. Abstraction makes explanations faster
to absorb. It also allows explanations to provide sum-
maries of subject programs.
(3) Partiality
.An explanation language should support par-
tial explanations. In other words, an explanation should
not be required to cover all of a subject program. Par-
tiality supports a gentle slope approach to explanations,
since it allows the incremental construction of more and
more complete explanations. Moreover, partiality allows
one to ignore parts that cannot be explained (because
they are not understood) or are trivial or unimportant.
162
Explaining Spreadsheets with Spreadsheets GPCE ’18, November 5–6, 2018, Boston, MA, USA
Figure 2. Top: Payroll Spreadsheet with Label Abstraction; Bottom: Explanation Sheet for the Payroll Spreadsheet
(4) Compositionality
.An explanation language should sup-
port constructing bigger explanations from smaller ones.
This requires composition operators for explanations.
Compositionality supports the systematic construction
of explanations and the reuse of explanations. Together
with partiality, compositionality supports the distributed
creation of explanations by dierent people who under-
stand dierent parts of the subject program.
Note that the rst two principles are sometimes in conict
with each other because abstraction calls for ignoring struc-
tures in the subject language. Moreover, there is a trade-o
between the benets that can be gained from abstraction
and the explicitness and simplicity oered by a detailed and
concrete description of a computation. As illustrated in [
20
],
this problem can be addressed by providing for one subject
program a set of explanations that are related and can be
explored in a systematic way.
3 An Explanation Language for
Spreadsheets
In Sections 3.1 and 3.2 we dene spreadsheets and expla-
nation sheets, respectively. In Section 3.3 we introduce a
relationship between the two languages that captures the
notion of explainability.
3.1 Spreadsheets
A spreadsheet is a rectangular grid of cells that contain for-
mulas and values. We can represent spreadsheets
sS
as
partial mappings from addresses
A=N×N
to formulas.
Formulas are either plain values (
vVal
), application of op-
erations (
ω
) to other formulas, and references to cells (
aA
).
The set of values includes an empty value
, which allows us
to distinguish undened cells that are part of the spreadsheet
from undened cells on the outside.
fFml ::=v|ω(f, . . . , f)|a
Abstracting from the contents of cells, we use the type con-
structor
α=Aα
to represent sheets indexed by ad-
dresses and storing values of type
α
. A spreadsheet
Fml
is then simply a sheet of formulas. Formulas are evaluated
to values Val, and we call the result of the evaluation of a
spreadsheet a value sheet, which is a sheet of values
Val
. The
semantics of a spreadsheet language are given by a function
J·K:Fml Val that maps spreadsheets to value sheets.
3.2 Explanation Sheets
Following the structure preservation principle from Section
2, we design an explanation of a spreadsheet to be itself a
kind of spreadsheet, a so-called explanation sheet that stores
formula explanations in cells. Following the abstraction prin-
ciple, an explanation sheet should abstract from some of the
details of the spreadsheet and should thus be smaller in size.
We therefore need a denition that allows one cell in an
explanation sheet to explain many cells in a spreadsheet.
To explain formulas, we need explanations for values,
references, and expressions built by operations applied to
other formulas. Since the values in a spreadsheet are either
numbers or strings, which are both ordered domains, we
can summarize a set of values from dierent cells by a value
range (
¯
vVal =Val ×Val
). Similarly, a set of references can
be summarized by an address range (
¯
aA=A×A
). The
two addresses of a range represent opposing corners of a
rectangular area, and the region denoted by a range is given
by the function
ρ
:
A→ A
, which is dened as follows (
/
compute the minimum/maximum of two numbers).
ρ(((x1,y1),(x2,y2)) ={(x,y)|x1x2xx1x2
y1y2yy1y2}
Since labels have been successfully employed in the past for
annotating and explaining cells [
4
,
19
,
26
,
29
], we use labels to
explain a set of references by one or two (row and/or column)
labels
Lab =Val Val ×Val
. More precisely, based on
a relationship
L ⊆ A×A
where
(a,a)∈ L
whenever the
163
GPCE ’18, November 5–6, 2018, Boston, MA, USA J. Cunha, M. Dan, M. Erwig, D. Fedorin, and A. Grejuc
Value
v2v
Value Range
v1vv2
(v1,v2)2v
Address Range
a1aa2
(a1,a2)2a
Formula
x12f1. . . xn2fn
ω(x1, . . . , xn)2ω(f1, . . . , fn)
Label
L(a)=
2a
Empty Value
(v1,v2)2
Empty Formula
ω(x1, . . . , xn)2
Unexplained
2f
Figure 3. Formula Explanations
value
S(a)
in cell
a
is considered to be a label for cell
a
, we
can dene a partial labeling function
L
:
ALab
, which
identies values as labels for cells.
L(a)=(S(a)if L1(a)={a}
(S(a1),S(a2)) if L1(a)={a1,a2}
L(a)is undened whenever L1(a)=.
We explain sets of formulas that share a common structure
and dier only in their references by a formula with labels
abstracting the references. Finally, we represent unexplained
areas using the special value
(“unexplained”), which allows
us to reduce potentially large chunks of a spreadsheet by a
single row, column, or cell.
Thus we obtain the following denition of explanation
formulas and the derived notion of explanation sheets Xpl.
xXpl ::=v|¯
v|a|¯
a||ω(x, . . . , x)| ⊥
The structure preservation embraced by
Xpl
aligns the struc-
ture and composition of an explanation sheet with that of
the explained spreadsheet.
3.3 Explaining Spreadsheets with Explanation
Sheets
A spreadsheet explanation is captured by a so-called zoom
Xη
2S
, which consists of an explanation sheet
X
, a spreadsheet
S
, a total function
η
that embeds the spreadsheet into the
explanation, that is,
dom(η)=dom(S)rng(η)=dom(X)
,
and whose explanation formulas explain the formulas of
the spreadsheet. The totality of
η
ensures that every cell in
S
is covered by a cell in
X
. We don’t require zooms to be
surjective to allow for “ller cells” in the explanation sheets
that serve no other purpose than to turn explanation sheets
into rectangular areas.
The purpose of zooms is to explain a number of similar
cells by one cell. Specically, when
η1(a)={a1, . . . , ak}
, we
use cell
a
to summarize, or explain, all the cells
a1, . . . , ak
.
We can formalize this idea through the notion of formula
explanation, which is dened as a binary relationship
x2f
that says an explanation formula
x
explains a spreadsheet
formula f, see Figure 3.
JvKX=(v,v)J¯
vKX=¯
vJaKX=JX(a)KX
J¯
aKX=↕{JX(a)KX|aρ(¯
a)}JKX=L1()
JxiKX=(v1
i,v2
i)v1
iviv2
i
Jω(x1, . . . , xn)KX=↕{Jω(v1, . . . , vn)KX}JKX=
Figure 4. Explanation Semantics
The cases for plain value, value range, and address range
should be obvious. Rule Formula requires that the explana-
tion and explained formulas have the same structure, and
the premise in the rule Label ensures that a label exists. The
rules Empty Value and Empty Formula allow empty values
to be explained by ranges and formulas, respectively, and the
rule Unexplained allows any formula to be left unexplained.
For a zoom
Xη
2S
we require that every formula in
X
ex-
plain all formulas in Sthat are mapped to it, that is:
adom(X),(a,a)η:X(a)2S(a)
Based on the semantics of spreadsheets, we can dene the se-
mantics for explanation sheets as follows. Since explanation
formulas include ranges of values and addresses, they will
generally evaluate to ranges of values.
1
To resolve references
the semantics needs access to the explanation sheet. Since
we also have to account for
formulas, the semantics of ex-
planation formulas is of type
J·K
:
Xpl Xpl Val ∪{⊥}
.
The denition is shown in Figure 4. We use the function
V=(V,V)
to compute the minimally enclosing range
for a set of values V. (We also use it for addresses.)
The semantics of explanation sheets is then given by the
following function J·K:Xpl Val∪{⊥ } .
JXK={(a,¯
v)|(a,x)XJxKX=¯
v}
Note that the semantics also depends on the underlying
subject sheet
S
and a labeling relationship
L
to resolve labels
() in explanation formulas.
Next we introduce the notion of zoom soundness. This is
essentially the
2
relationship for value ranges and values
applied to whole sheets that are connected via a function η.
We say that an explanation
X
is sound for a spreadsheet
S
under
η
if
JXKη
2JSK
. This relationship captures the notion
that an explanation sheet
X
covers all cases of the explained
spreadsheet
S
and that the evaluation of
S
holds no surprises.
Now we can present our main result, which says that
zooms are sound.
Theorem 3.1 (Soundness).Xη
2S=JXKη
2JSK
Note that for any spreadsheet
S
we always can nd a trivial
explanation through the zoom
Sid
2S
,
2
which means that any
1A single value vcan always be represented by a trivial range (v,v).
2Here id denotes the identity function.
164
Explaining Spreadsheets with Spreadsheets GPCE ’18, November 5–6, 2018, Boston, MA, USA
spreadsheet trivially explains itself. However, such a zoom
is not really useful, since it does not achieve any abstraction.
Employing a straightforward ordering on zooms based on
the size of the explanation sheet, we can dene that a zoom
X1
η1
2S
achieves a higher explanatory reduction than a zoom
X2
η2
2S
if
|dom(X1)|<|dom(X2)|
. Note that this relationship
denes a partial order, and there isn’t necessarily a single
smallest explanation.
We can identify a number of interesting relationships for
explanations, including notions such as explanatory cover-
age and based on that also an explanation renement rela-
tionship. We leave that for future work.
4 Artifact Evaluation
We employed real-world spreadsheets from two spreadsheet
corpora in the design and evaluation of our approach.
4.1 Guiding the Design of Explanations
The design of our spreadsheet explanations was guided in
part by real-world example spreadsheets. With a preliminary
denition of explanation formulas and zooms we set out to
explain existing spreadsheets from two repositories.
Specically, we analyzed 20 randomly selected spread-
sheets from [
31
], which are generally well-designed spread-
sheets created by experts, plus 20 randomly selected spread-
sheets from the Enron spreadsheets corpus [
22
], which in-
cludes more than 15,000 spreadsheets.
We manually created an explanation spreadsheet for each
of the 40 spreadsheets. The main purpose of this exercise
was to see whether explanation formulas are general enough
or maybe even unnecessarily too general and whether our
denition of zooms worked as anticipated.
During this testing phase, the explanation model was re-
vised several times. Specically, we removed a number of
explanation formulas that we originally thought to be useful
because the anticipated situations did either not occur at all
or only once or twice and thus were not justifying a more
elaborate notion of explanation formulas. We also simplied
the denition of zooms, which originally were dened re-
cursively allowing for nested zooms to explain nested loop
structures in spreadsheets. But since such a nested loop struc-
ture occurred only in one of the selected examples, we traded
the more general denition for a simpler one.
4.2 Applicability and Impact
We randomly selected a new set of spreadsheets from the
two sources and then used 41 worksheets from 36 dier-
ent spreadsheets to analyze the applicability and eect of
spreadsheet explanations.
We observed that 78% (32/41) of worksheets contained
areas that could be compressed and explained by zooms
(20/10/2 worksheets contained row/column/row and column
zooms). Ignoring three huge spreadsheets that were basically
used as databases and that would lead to a misleadingly high
average, the average (min/max) size compression achieved
by zooms was 64% (25%/99%).
To verify that the generated explanations were correct,
we developed an explanation checker that implements the
denitions from Section 3.3, specically Figure 3. This expla-
nation checker was applied to all explanations and helped to
correct at least one error in 24 of the 41 explanation sheets.
Most of the errors were due to simple typos, but the checker
also found several incorrect range mappings in zooms and
other reference errors. A summary of the kinds of errors that
were detected is shown in Figure 5.
5 User Evaluation
In this section we describe a preliminary user study we per-
formed to assess the usefulness of explanation spreadsheets.
Spreadsheets.
For this study, we have semi-randomly se-
lected 4 spreadsheets (labeled A through D) from 3 dierent
sources. The selection process was semi-random in the sense
that we had to randomly re-select a new spreadsheet if the
previous one was not suited (for example, when it lacked
formulas). We selected two spreadsheets from the EUSES cor-
pus [
21
], one from [
31
], and one from the Enron corpus [
22
].
Participants.
We recruited ten participants from two uni-
versities who were either computer science graduate stu-
dents or had already nished their graduate degree. Most
of the participants are quite experienced spreadsheet users,
some of them having created over 100 spreadsheets. Two
of the participants were females and eight were males with
ages ranging from 23 to 45.
Procedure.
For each spreadsheet we created the correspond-
ing explanation. We then created two sets of artifacts, each
consisting of two spreadsheets and two explanation sheets.
The rst set contained the spreadsheets for A and C and the
explanation sheets for B and D, and the second set contained
the spreadsheets for B and D and the explanation sheets
for A and C. We then randomly assigned each participant
one of the sets. This way each participant had to work with
spreadsheets as well as explanation sheets, in dierent or-
ders, thus reducing learning eects. We had the participants
go over a one-page tutorial about spreadsheet explanations
before answering the actual questions. For each of the four
spreadsheets we asked two questions:
N Error Type
4 Value (value in Sis mapped to a dierent value in X)
7 Range (value in Sis not covered by the range in X)
4 Reference (undened references in mappings)
1 Label A (value of the label does not match the value in S)
8 Label B (labeling does not correctly abstract reference)
Figure 5. Errors found by the Explanation Checker
165
GPCE ’18, November 5–6, 2018, Boston, MA, USA J. Cunha, M. Dan, M. Erwig, D. Fedorin, and A. Grejuc
Table 1. Average times and scores in the empirical study
average time average score
subject explanation subject explanation
AQ1 1.3 2.1 2.2 2.4
Q2 1.1 2.2 3.0 2.8
BQ1 3.1 2.9 2.0 2.6
Q2 2.5 3.7 2.0 1.8
CQ1 2.1 1.8 3.0 1.0
Q2 1.0 2.9 2.4 1.4
DQ1 3.6 5.4 1.2 1.4
Q2 6.8 3.3 1.8 2.0
Q1 What is being calculated in row/column/cell X?
Q2 How
are the values in row/column/cell X calculated?
Results.
We present the results of the study in Table 1. For
each spreadsheet and its explanation we show the average
time (in minutes) participants took to answer each question
and the average score of the answers. We scored each answer
with a value from 0 (wrong answer) to 3 (entirely correct).
Discussion.
The user evaluation produced mixed results.
Explanation sheets led to higher scores in 3 out of the 4 sce-
narios, with the exception of C, which produced signicantly
lower results. Here we note that the explanation sheet for
C employed a column header (S) as a label where none was
provided by the subject spreadsheet. As the participants had
no prior knowledge of explanations, this could have made
it hard to infer the meaning of the column reference, thus
impacting understandability.
There is no signicant dierence between the average
times it took participants to answer the questions. With the
exception of D, participants were able to determine how a
computation was performed faster using the explanation.
However, explanations were only faster at explaining what
a computation calculated in cases B and C.
Interestingly, participants took longer to answer questions
for simple spreadsheets using explanations. This can possi-
bly be attributed to the fact that the participants have had
extensive experience with spreadsheets, while none with
explanations. This also seems to indicate that explanation
sheets are probably more useful for complex spreadsheets.
These results indicate the potential of explanation sheets
for providing a better understanding of spreadsheets. In their
answers to a post-study survey eight out of ten participants
said that they found explanation sheets somewhat or very
helpful. (For the two other participants they didn’t make a
dierence.) Also, eight of the participants would want to use
explanation sheets in the future.
6 Related Work
Amaltano et al. have developed a tool to help end users to
comprehend VBA-based Excel applications [
6
]. This tool is
designed to work with Excel programs that heavily depend
on VBA macros, explaining the relation between the macros
and the cells, and cannot be used to understand the data or
cell computations in a spreadsheet as we propose.
Kankuzi and Sajaniemi have investigated spreadsheet au-
thors’ mental models [
27
]. Based on that, they proposed a
tool that translates cell references into domain/real-world
terms using the labels in the spreadsheet. However, their
focus is on error detection, while ours in on understanding
the content of a spreadsheet.
Several authors have proposed techniques to identify
structural information in spreadsheets [
2
,
5
,
7
,
11
15
,
23
].
Although identied structures may help understand spread-
sheets, there is no direct evidence for this, and none of the
mentioned approaches was meant to help understanding
spreadsheets, but instead to give some kind of support to
manipulate spreadsheet content.
The use of spreadsheets labels is not new [
19
]. In fact,
several approaches have been proposed for the inference of
cell labels [
1
], to reason about spreadsheets [
10
] using their
labels, and to use them for error checking [
3
]. And while
sometimes labels are used to explain spreadsheet errors [
4
],
none of this work was intended to explain spreadsheets
purpose. Some tools provide named ranges for dening a
name for a range of cells, which can then be used instead of
regular references. Although this feature has been shown to
make debugging less eective [
28
], there is no evidence of
its impact in understanding spreadsheets.
Hermans has proposed techniques to understand the de-
pendencies between dierent worksheets [
24
,
25
]. In con-
trast, our work improves the understandability of each work-
sheet. Thus, these two approaches are complementary.
Some researchers have proposed tools to support users
in documenting their spreadsheets [
9
]. However, such tools
require users to write documentation, which they usually do
not do. While our approach also relies on an additional arti-
fact, this has a familiar structure and can be added incremen-
tally. Most importantly, the formal structure of explanation
sheets and their relationship to spreadsheets oers oppor-
tunities for the automatic inference of explanation sheets,
which is something we plan to work on in the future.
7 Conclusions
We have presented the concept of explanation sheets to sup-
port the understanding of spreadsheets. The design of ex-
planation sheets has some appealing properties: First, they
facilitate the gradual and incremental construction of spread-
sheet explanations. Second, their formalization supports the
denition of tools for checking, for example, the correctness
of explanation, which we have already exploited. There are
many other support tools that can be envisioned and that we
will investigate in future work, such as checking the cover-
age of explanations, checking the compatibility of alternative
explanations, and the inference of explanation sheets.
166
Explaining Spreadsheets with Spreadsheets GPCE ’18, November 5–6, 2018, Boston, MA, USA
References
[1]
R. Abraham and M. Erwig. 2004. Header and Unit Inference for Spread-
sheets Through Spatial Analyses. In IEEE Int. Symp. on Visual Lan-
guages and Human-Centric Computing. 165–172.
[2]
R. Abraham and M. Erwig. 2006. Inferring Templates from Spread-
sheets. In 28th IEEE Int. Conf. on Software Engineering. 182–191.
[3]
R. Abraham and M. Erwig. 2007. UCheck: A Spreadsheet Unit Checker
for End Users. Journal of Visual Languages and Computing 18, 1 (2007),
71–95.
[4]
R. Abraham, M. Erwig, and S. Andrew. 2007. A Type System Based
on End-User Vocabulary. In IEEE Int. Symp. on Visual Languages and
Human-Centric Computing. 215–222.
[5]
Sorin Adam and Ulrik Pagh Schultz. 2015. Towards Tool Support for
Spreadsheet-based Domain-specic Languages. SIGPLAN Not. 51, 3
(Oct. 2015), 95–98.
[6]
Domenico Amaltano, Vincenzo De Simone, Anna Rita Fasolino, and
Porrio Tramontana. 2016. EXACT: A tool for comprehending VBA-
based Excel spreadsheet applications. Journal of Software: Evolution
and Process 28, 6 (2016), 483–505.
[7]
Domenico Amaltano, Anna Rita Fasolino, Porrio Tramontana, Vin-
cenzo De Simone, Giancarlo Di Mare, and Stefano Scala. 2015. A Re-
verse Engineering Process for Inferring Data Models from Spreadsheet-
based Information Systems: An Automotive Industrial Experience. In
Data Management Technologies and Applications. 136–153.
[8]
Brian Bishop and Kevin McDaid. 2008. An Empirical Study of End-
User Behaviour in Spreadsheet Error Detection & Correction. CoRR
abs/0802.3479 (2008).
[9]
Diogo Canteiro and Jácome Cunha. 2015. SpreadsheetDoc: An Ex-
cel Add-in for Documenting Spreadsheets. In Proceedings of the 6th
National Symposium of Informatics.
[10]
C. Chambers and M. Erwig. 2010. Reasoning about Spreadsheets with
Labels and Dimensions. Journal of Visual Languages and Computing
21, 5 (2010), 249–262.
[11]
Zhe Chen and Michael Cafarella. 2013. Automatic Web Spreadsheet
Data Extraction. In 3rd International Workshop on Semantic Search Over
the Web. Article 1, 8 pages.
[12]
J. Cuna, M. Erwig, and J. Saraiva. 2010. Automatically Inferring
ClassSheet Models from Spreadsheets. In IEEE Int. Symp. on Visual
Languages and Human-Centric Computing. 93–100.
[13]
Jácome Cunha, Martin Erwig, Jorge Mendes, and João Saraiva. 2016.
Model inference for spreadsheets. Automated Software Engineering 23,
3 (2016), 361–392.
[14]
Jácome Cunha, João Saraiva, and Joost Visser. 2009. From Spreadsheets
to Relational Databases and Back. In 2009 ACM SIGPLAN Symposium
on Partial Evaluation and Semantics-based Program Manipulation. 179–
188.
[15]
Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. 2018.
Expandable Group Identication in Spreadsheets. In 33rd ACM/IEEE
Int. Conf. on Automated Software Engineering. 498–508.
[16] G. Engels and M. Erwig. 2005. ClassSheets: Automatic Generation of
Spreadsheet Applications from Object-Oriented Specications. In 20th
IEEE/ACM Int. Conf. on Automated Software Engineering. 124–133.
[17]
M. Erwig, R. Abraham, I. Cooperstein, and S. Kollmansberger. 2005.
Automatic Generation and Maintenance of Correct Spreadsheets. In
27th IEEE Int. Conf. on Software Engineering. 136–145.
[18]
M. Erwig, R. Abraham, S. Kollmansberger, and I. Cooperstein. 2006.
Gencel – A Program Generator for Correct Spreadsheets. Journal of
Functional Programming 16, 3 (2006), 293–325.
[19]
M. Erwig and M. M. Burnett. 2002. Adding Apples and Oranges. In 4th
Int. Symp. on Practical Aspects of Declarative Languages (LNCS 2257).
173–191.
[20]
M. Erwig and E. Walkingshaw. 2013. A Visual Language for Explaining
Probabilistic Reasoning. Journal of Visual Languages and Computing
24, 2 (2013), 88–109.
[21]
Marc Fisher and Gregg Rothermel. 2005. The EUSES Spreadsheet
Corpus: A Shared Resource for Supporting Experimentation with
Spreadsheet Dependability Mechanisms. SIGSOFT Softw. Eng. Notes
30, 4 (May 2005), 1–5.
[22]
Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s Spread-
sheets and Related Emails: A Dataset and Analysis. In 37th Int. Conf.
on Software Engineering. 7–16.
[23]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2010. Auto-
matically Extracting Class Diagrams from Spreadsheets. In European
Conference on Object-Oriented Programming 2010. 52–75.
[24]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2011.
Breviz: Visualizing Spreadsheets using Dataow Diagrams. CoRR
abs/1111.6895 (2011).
[25]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2011.
Supporting Professional Spreadsheet Users by Generating Leveled
Dataow Diagrams. In 33rd Int. Conf. on Software Engineering. 451–
460.
[26]
B. Kankuzi and J. Sajaniemi. 2014. Visualizing the problem domain
for spreadsheet users: A mental model perspective. In IEEE Symp. on
Visual Languages and Human-Centric Computing. 157–160.
[27]
Bennett Kankuzi and Jorma Sajaniemi. 2016. A mental model per-
spective for tool development and paradigm shift in spreadsheets.
International Journal of Human-Computer Studies 86 (2016), 149 – 163.
[28]
Ruth McKeever,Kevin McDaid, and Brian Bishop. 2009. An Explorator y
Analysis of the Impact of Named Ranges on the Debugging Perfor-
mance of Novice Users. CoRR abs/0908.0935 (2009).
[29]
B. A. Nardi and J. R. Miller. 1991. Int. Journal of Man-Machine Studies.
(1991), 161–184.
[30]
Santanu Paul, Atul Prakash, Erich Buss, and John Henshaw. 1991.
Theories and Techniques of Program Understanding. In Conf. of the
Centre for Advanced Studies on Collaborative Research. 37–53.
[31]
Stephen G. Powell and Kenneth R. Baker. 2003. The Art of Modeling
with Spreadsheets. John Wiley & Sons, Inc., New York, NY, USA.
[32]
R. Pressman. 2001. Software Engineering: A Practitioner’s Approach
(5th ed.). McGraw-Hill, New York, NY.
[33]
C. Verhoef. 2000. How to Implement the Future. In 26th Euromicro
Conference. 32–47.
[34]
A. von Mayrhauser, M. Vans, and A. Howe. 1997. Understanding
Behaviour During Enhancement of Large-scale Software. Journal on
Software Maintenance: Research and Practice 9, 5 (1997), 299–327.
[35]
E. Walkingshaw and M. Erwig. 2011. A DSEL for Studying and Ex-
plaining Causation. In IFIP Working Conference on Domain-Specic
Languages. 143–167.
167
... 'cell A1') rather than representations clear to users or their specific domains (e.g. 'profit, ' 'budget') [31,47,50,60]. ...
Preprint
Spreadsheet collaboration provides valuable opportunities for learning and expertise sharing between colleagues. Sharing expertise is essential for the retention of important technical skillsets within organisations, but previous studies suggest that spreadsheet experts often fail to disseminate their knowledge to others. We suggest that social norms and beliefs surrounding the value of spreadsheet use significantly influence user engagement in sharing behaviours. To explore this, we conducted 31 semi-structured interviews with professional spreadsheet users from two separate samples. We found that spreadsheet providers face challenges in adapting highly personalised strategies to often subjective standards and evaluating the appropriate social timing of sharing. In addition, conflicted self-evaluations of one's spreadsheet expertise, dismissive normative beliefs about the value of this knowledge, and concerns about the potential disruptions associated with collaboration can further deter sharing. We suggest these observations reflect the challenges of long-term learning in feature-rich software designed primarily with initial learnability in mind. We therefore provide implications for design to navigate this tension. Overall, our findings demonstrate how the complex interaction between technology design and social dynamics can shape collaborative learning behaviours in the context of feature-rich software.
... Furthermore, Sajaniemi (2013, 2014) have shown that experts employ unique mental models when interpreting spreadsheet data, identifying errors, and rectifying them. Cunha et al. (2018) leveraged these findings by incorporating different types of error proposed by Bishop and McDaid (2008) to create a user-friendly interface aimed at facilitating spreadsheet authors in comprehending errors and correcting them in spreadsheets created by other authors. Recent research, such as Huang et al. (2020), has focused on developing new techniques like WARDER for effective spreadsheet defect detection by utilizing validity-based cell cluster refinements. ...
Article
Full-text available
User models that can directly use and learn how to do tasks with unmodified interfaces would be helpful in system design to compare task knowledge and times between interfaces. Including user errors can be helpful because users will always make mistakes and generate errors. We compare three user models: an existing validated model that simulates users’ behavior in the Dismal spreadsheet in Emacs, a newly developed model that interacts with an Excel spreadsheet, and a new model that generates and fixes user errors. These models are implemented using a set of simulated eyes and hands extensions. All the models completed a 14-step task without modifying the system that participants used. These models predict that the task in Excel is approximately 20% faster than in Dismal, including suggesting why, where, and how much Excel is a better design. The Excel model predictions were compared to newly collected human data (N = 23). The model’s predictions of subtask times correlate well with the human data (r2 = .71). We also present a preliminary model of human error and correction based on user keypress errors, including 25 slips. The predictions to data comparison suggest that this interactive model that includes errors moves us closer to having a complete user model that can directly test interface design by predicting human behavior and performing the task on the same interface as users. The errors from the model’s hands also allow further exploration of error detection, error correction, and different knowledge types in user models.
... Furthermore, Sajaniemi (2013, 2014) have shown that experts employ unique mental models when interpreting spreadsheet data, identifying errors, and correcting them. (Cunha et al., 2018) leveraged these findings by incorporating error types proposed by Bishop and McDaid (2008) to create a userfriendly interface aimed at facilitating spreadsheet authors in comprehending and correcting errors in spreadsheets created by other authors. ...
Article
Full-text available
Replacing users with models that simulate users’ behaviors has been a long-standing vision in interface design. Cognitive models can simulate users’ cognitive processes and behaviors, but they cannot fully interact with user interfaces and simulate all behaviors. VisiTor (Vision+Motor) is a tool that provides cognitive models with methods to interact with the interface that users interact with. However, currently, cognitive models can only simulate experts and error-free users. Previous attempts at using cognitive models were inadequate due to hard-coded errors, which led to predictions that were not applicable to all users and tasks. Therefore, a general approach is required to include error generations and error corrections in cognitive models. VisiTor has been extended to generate and correct errors, leading to the development of an Automatic Error Cognitive Model that simulates more complete users’ behaviors in an Excel spreadsheet. This model helps to get closer to developing complete cognitive models that can replace users.
Article
Full-text available
Spreadsheets are commonly used by non-programmers to store data in a structured form, this data can in some cases be considered to be a program in a domain-specific language (DSL). Unlike ordinary text-based domain-specific languages, there is however currently no formalism for expressing the syntax of such spreadsheet-based DSLs (SDSLs), and there is no tool support for automatically generating language infrastructure such as parsers and IDE support. In this paper we define a simple notion of two-dimensional grammars for SDSLs, and show how such grammars can be used for automatically generating parsers that extract structured data from a spreadsheet in the form of an AST. We demonstrate automatic generation of parsers for a number of examples, including the questionnaire DSL from LWC2014 and a DSL for writing safety specifications.
Conference Paper
Full-text available
Spreadsheets are commonly used by non-programmers to store data in a structured form, this data can in some cases be considered to be a program in a domain-specific language (DSL). Unlike ordinary text-based domain-specific languages, there is however currently no formalism for expressing the syntax of such spreadsheet-based DSLs (SDSLs), and there is no tool support for automatically generating language infrastructure such as parsers and IDE support. In this paper we define a simple notion of two-dimensional grammars for SDSLs, and show how such grammars can be used for automatically generating parsers that extract structured data from a spreadsheet in the form of an AST. We demonstrate automatic generation of parsers for a number of examples, including the questionnaire DSL from LWC2014 and a DSL for writing safety specifications.
Article
Full-text available
To address the problem of errors in spreadsheets, we have investigated spreadsheet authors mental models in a hope of finding cognition-based principles for spreadsheet visualization and debugging tools. To this end, we have conducted three empirical studies. The first study explored the nature of mental models of spreadsheet authors during explaining and debugging tasks. It was found that several mental models about spreadsheets are activated in spreadsheet authors minds. Particularly, when explaining a spreadsheet, the real-world and domain mental models are prominent, and the spreadsheet model is suppressed; however, when locating and fixing an error, one must constantly switch back and forth between the domain model and the spreadsheet model, which requires frequent use of the mapping between problem domain concepts and their spreadsheet model counterparts. The second study examined the effects of replacing traditional spreadsheet formulas with problem domain narratives in the context of a debugging task. Domain narratives were found to be easy to learn and they helped participants to locate more errors in spreadsheets. Furthermore, domain narratives also increased the use of the domain mental model and appeared to improve the mapping between the domain and spreadsheet models. The third study investigated the effects of allowing spreadsheet authors to fix errors by editing domain narratives, thus relieving them from the use of traditional low-level cell references. This scenario was found to promote spreadsheet authors using even more of their domain mental model in a manner that completely overshadowed the use of their spreadsheet mental model. Thus, from a mental model perspective, it is possible to devise a new spreadsheet paradigm that uses domain narratives in place of traditional spreadsheet formulas, thus automatically presenting spreadsheet content so that it prompts spreadsheet authors to think in a manner that closely corresponds to their mental models of the application domain.
Article
Full-text available
Many errors in spreadsheet formulas can be avoided if spreadsheets are built automatically from higher-level models that can encode and enforce consistency constraints in the generated spreadsheets. Employing this strategy for legacy spreadsheets is difficult, because the model has to be reverse engineered from an existing spreadsheet and existing data must be transferred into the new model-generated spreadsheet. We have developed and implemented a technique that automatically infers relational schemas from spreadsheets. This technique uses particularities from the spreadsheet realm to create better schemas. We have evaluated this technique in two ways: first, we have demonstrated its applicability by using it on a set of real-world spreadsheets. Second, we have run an empirical study with users. The study has shown that the results produced by our technique are comparable to the ones developed by experts starting from the same (legacy) spreadsheet data. Although relational schemas are very useful to model data, they do not fit spreadsheets well, as they do not allow expressing layout. Thus, we have also introduced a mapping between relational schemas and ClassSheets. A ClassSheet controls further changes to the spreadsheet and safeguards it against a large class of formula errors. The developed tool is a contribution to spreadsheet (reverse) engineering, because it fills an important gap and allows a promising design method (ClassSheets) to be applied to a huge collection of legacy spreadsheets with minimal effort.
Conference Paper
Full-text available
Earlier research on spreadsheet authors' mental models has shown that people possess several types of information about spreadsheets. Especially, when explaining a spreadsheet, the real-world and domain mental models are prominent and the spreadsheet model is suppressed, but when locating and fixing an error, one must constantly switch back and forth between the domain model and the spreadsheet model, which requires frequent use of the mapping between problem domain concepts and their spreadsheet model counterparts. This paper introduces a new spreadsheet visualization tool and its empirical evaluation. The tool translates traditional spreadsheet formulas into problem domain narratives and highlights referenced cells. The tool was found to be easy to learn and helped the participants to locate more errors in spreadsheets. Furthermore, the tool increased the use of the domain mental model and appeared to improve the mapping between the spreadsheet model and the domain model.
Conference Paper
Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units in an expandable group share the same or similar formats and semantics. As a data storage and management tool, expandable groups represent the fundamental structure in spreadsheets. However, existing spreadsheet systems do not recognize any expandable groups. Therefore, other spreadsheet analysis tools, e.g., data integration and fault detection, cannot utilize this structure of expandable groups to perform precise analysis. In this paper, we propose ExpCheck to automatically extract expandable groups in spreadsheets. We observe that continuous units that share the similar formats and semantics are likely to be an expandable group. Inspired by this, we inspect the format of each cell and its corresponding semantics, and further classify them into expandable groups according to their similarity. We evaluate ExpCheck on 120 spreadsheets randomly sampled from the EUSES and VEnron corpora. The experimental results show that ExpCheck is effective. ExpCheck successfully detect expandable groups with F1-measure of 73.1%, significantly outperforming the state-of-the-art techniques (F1-measure of 13.3%).
Article
Spreadsheet applications are widely adopted by millions of end users from several application domains and provide strategic support to many business, scientific, industrial, and organizational processes. These applications are usually developed by rapid application development processes, exploiting host scripting languages allowing the basic spreadsheets to provide complex functionality, business rules, and user interfaces. Several factors complicate the comprehension of these applications because they are usually developed and maintained by end users without specific software engineering skills, grow over time, are not adequately documented, and do not present explicit separation between data, business logic, and user interface layers. This paper presents a reverse engineering tool intended to support the comprehension of Excel spreadsheet applications developed using the Visual Basic for Application programming language. The tool has been implemented as an add-in that extends the Excel working environment by providing analysis and visualization features. It is able to extract information about the elements composing the analyzed Excel spreadsheet application, the functionality it exposes through its user interface, and the dependencies among its cells. This information is provided by means of interactive views. The validity of the tool has been assessed by a qualitative case study performed with professional end users from an automotive industrial domain.
Conference Paper
Nowadays Spreadsheet-based Information Systems are widely used in industries to support different phases of their production processes. The intensive employment of Spreadsheets in industry is mainly due to their ease of use that allows the development of Information Systems even by not experienced programmers. The development of such systems is further aided by integrated scripting languages (e.g. Visual Basic for Applications, Libre Office Basic, JavaScript, etc.) that offer features for the implementation of Rapid Application Development processes. Although Spreadsheet-based Information Systems can be developed with a very short time to market, they are usually poorly documented or in some case not documented at all. As a consequence, they are very difficult to be comprehended, maintained or migrated towards other architectures, such as Database Oriented Information Systems or Web Applications. The abstraction of a data model from the source spreadsheet files represents a fundamental activity of the migration process towards different architectures. In our work we present an heuristic- based reverse engineering process for inferring a data model from an Excel based information system. The process is fully automatic and it is based on seven sequential steps. Both the applicability and the effectiveness of the proposed process have been assessed by an experiment we conducted in the automotive industrial context. The process was successfully used to obtain the UML class diagrams representing the conceptual data models of three different Spreadsheet-based Information Systems. The paper presents the results of the experiment and the lessons we learned from it.