ChapterPDF Available

Automated Methods for Webpage Usability & Accessibility Evaluations

Authors:
Chapter

Automated Methods for Webpage Usability & Accessibility Evaluations

Abstract and Figures

In this chapter, Section2 described our method that extracts cues for finding usability problems from user/desired logs of clicked points. To detect inconsistencies between user and desired logs, the method compares operations in the logs. The method compares user/desired operations by modeling each operation as a vector derived from coordinate values of the clicked points and checking the distance between two vectors. The distance was defined as a weighted sum of distance between start points and size of difference for the two vectors. The method extracts two types of inconsistencies: unnecessary and missed operations. Effectiveness of the proposed method was evaluated based on a case study. Each of the two human evaluators tried to find usability problems for ten websites by the proposed method and the manual method respectively. The proposed method contributed to find 61% of the usability problems found by the manual method in much smaller amount of time: the number of clicks analyzed by an evaluator with the proposed method was only 10-20% of that with the manual method. This result indicates that the method will help evaluators to quickly and roughly focus their attentions to problems cues in user interactions. In our future work, additional case studies are necessary for further evaluations and improvements of the method.
Content may be subject to copyright.
21
Automated Methods for Webpage Usability &
Accessibility Evaluations
Hidehiko Okada1 and Ryosuke Fujioka2
1 Kyoto Sangyo University, 2 Kobe Sogo Sokki Co. Ltd.
Japan
1. Introduction
Because of the rapid growth of the web and its users, web usability and accessibility become
more and more important. This chapter describes two automated methods for evaluating
usability/accessibility of webpages. The first method, for usability evaluation, is based on
interaction logging and analyses, and the second method, for accessibility evaluation, is
based on machine learning. In the following Sections 2 and 3, the two methods are described
respectively.
Several methods have been proposed and developed for usability evaluation based on user
interaction logs. Interaction logs can be recorded by computer logging programs (automated
logging) or by human evaluators (observational logging). Analysis methods and tools for
the former type of logs are well summarized by Ivory (Ivory & Hearst, 2001; Ivory, 2003),
and those for the latter type of logs have also been developed (for example, by Qiang (Qiang
et al., 2005)). Our method is for analyzing former type of logs. In the survey by Ivory,
analysis methods for automatically captured log files for WIMP (Window, Icon, Menu and
Pointing device) applications and web applications are categorized in terms of their
approaches including metric-based, pattern-based, task-based and inferential ones. Some of
the methods with the task-based approach compare user interaction logs for a test task with
desired (expected) interaction sequences for the task and detect inconsistencies between the
user logs and the desired sequences (Kishi, 1995; Uehling & Wolf, 1995; Okada & Asahi,
1996; Al-Qaimari & Mcrostie, 1999; Helfrich & Landay, 1999; Zettlemoyer et al., 1999). The
inconsistencies are useful cues for finding usability problems: for example, an evaluator can
find that users selected some unexpected link on a webpage when another link on the page
was expected for the test task and that the link selected by the users may have some
usability problem in its design (labeling, layout, etc.).
The existing methods require widget-level logs for the comparisons. For example, the
method proposed by Okada (Okada & Asahi, 1996) requires interaction logs to include data
of widget properties such as widget label, widget type, title of parent window, etc. This
requirement degrades independency and completeness of the methods in logging user
interactions with systems under evaluation. Section 2 in this chapter describes our method
that detects inconsistencies between user logs and desired sequences based on logs of clicked
points ((x, y) coordinate values). Coordinate values of clicked points can be easily and fully
logged independently of what widgets are clicked on. Several existing methods have also
Advances in Human-Computer Interaction
352
utilized the logs of mouse clicked points (for example, “Mousemap” that visualizes mouse
moves (Gellner & Forbrig, 2003)), but the methods do not achieve the detection of
inconsistencies. The authors have developed a computer tool for logging and analyzing user
interactions and desired sequences by our method. The tool is applied to experimental
usability evaluations of websites. Effectiveness of the method in usability testing of
webpages is evaluated based on the application result.
Besides, to make webpages more accessible to people with disabilities, <table> tags should
not be used as a means to visually layout document content: layouting contents by <table>
tags may present problems when rendering to non-visual media (W3C, 1999ab). It is
reported that the number of tables on pages doubled from 7 in 2000 to 14 in 2003, with most
tables being used to control page layouts (Ivory & Megraw, 2005). Therefore, to evaluate the
accessibility of webpages, it should be checked whether the pages include layout-purpose
<table> tags. Several methods and tools have been proposed and developed for web
accessibility (Cooper, 1999; Scapin et al., 2000; Ivory, 2003; Ivory et al., 2003; Abascal et al.,
2004; Brajnik, 2004; Beirekdar et al., 2005; Vanderdonckt & Beirekdar, 2005). Accessibility
tools are listed in (W3C, 2006; Perlman, 2008) and compared in (Brinck et al., 2002). Some
tools detect deeply nested <table> tags that are likely to be layout-purpose ones. Still, a
method for automated detection of layout-purpose <table> tags in HTML sources is a
challenge: it requires further than simply checking whether specific tags and/or attributes of
the tags are included in the sources. Section 3 describes our method for the detection that is
based on machine learning. The proposed method derives a <table> tag classifier that
classifies the purpose of the tag: the classifier deduces whether a <table> tag is a layout-
purpose one or a table-purpose one. The section describes our system that implements the
proposed method and report a result of experiment for evaluating classification accuracy.
2. Automated Method for Webpage Usability Evaluation
2.1 Method for analyzing mouse click logs
2.1.1 User logs and desired logs
A user log can be collected by logging mouse clicks while a user (who does not know the
desired sequence of a test task) performs the test task in user testing. In our research, a log
file is collected for a test user and a test task: if the number of users is N and the number of
tasks is M then the number of user log files is N*M (where all the N users completes all the
M tasks). A “desired” log is collected by logging mouse clicks while a user (who knows well
the desired sequence of a test task) performs the test task. For a test task, one desired log file
is usually collected. If two or more different interaction sequences are acceptable as desired
ones for a test task, two or more desired log files can be collected (and used in the
comparison described later).
2.1.2 Method for detecting inconsistencies in user/desired logs
Our method models two successive clicks as a vector and thus a sequence of operation in a
user/desired log as a sequence of vectors. A vector is from the ith clicked point to the (i+1)th
clicked point in the screen. To detect inconsistencies in a user log and a desired log, each
vector from the user log is compared with each vector from the desired log. If the distance of
the two vectors (vu from the user log and vd from the desired log) is smaller than a
threshold, vu and vd are judged as being matched: the user operation modeled by vu is
Automated Methods for Webpage Usability & Accessibility Evaluations
353
supposed to be the same operation modeled by vd. The method defines the distance D(vu,
vd) as a weighted sum of Dp and Dv (Fig. 1).
D
p = ( wx(xu1-xd1)2 + wy(yu1-yd1)2 )0.5 (1)
D
v = ( wx(xu2-(xu1-xd1)-xd2)2 + wy(yu2-(yu1-yd1)-yd2)2 )0.5 (2)
D(vu, vd) = wpDp + wvDv (3)
vd
vu
(xd1,yd1)
(xd2,yd2)
(xu1,yu1)
(xu2,yu2)
Figure 1. Two vectors vu, vd and their distance
The role of weight factors wx and wy used in the calculations of Dp and Dv is as follows.
Users click on links in webpages under evaluation to perform their tasks. The width of a link
is usually larger than the height of the link, especially of a text link. Therefore, the
differences of clicked points for clicking on the same link are likely to become larger for the
horizontal axis (the x coordinate values) than for the vertical axis (the y coordinate values).
To deal with this, weights wx and wy are used so that the horizontal differences can be
counted smaller than the vertical differences.
User operations to scroll webpages by using mouse wheels should also be taken into
account: scrolls by mouse wheels changes widget (e.g., link) positions in the screen so that
the clicked positions may not be the same even for the same widget. Our method records the
amount of wheel scrolls while logging interactions. By using the logs of wheel scrolls,
coordinate values of clicked points are adjusted. Fig. 2 shows this adjustment. Suppose a
user clicked on the point (xi, yi) in the screen (Fig. 2(a)) and then clicked on the point (xi+1,
yi+1) (Fig. 2(b)). In this case, the vector derived from the two clicks is the one shown in Fig.
2(c). As another case, suppose a user scrolled down a webpage along the y axis by using the
mouse wheel between the two clicks and the amount of the scroll was S pixels. In this case,
the vector derived from the two clicks is the one in Fig. 2(d).
2.1.3 Two types of inconsistencies as cues of usability problems
As cues of usability problems, the proposed method detects two types of inconsistencies
between user interactions and desired sequences. The authors refer to them as “unnecessary”
operations and “missed” operations. Fig. 3 illustrates these kinds of operations.
Unnecessary operations are user operations judged as not included in the desired
sequences, i.e., unnecessary operations are operations in a user log for which any operation
in desired logs is not judged as the same one in the comparison of the user/desired logs.
Our method supposes such user operations as unnecessary because the operations may not
be necessary for completing the test task. Unnecessary operations can be cues for human
evaluators to find usability problems that users clicked on a confusing link when another
link is desired (expected) to be clicked on for the task.
Advances in Human-Computer Interaction
354
(xi,yi)
(xi+1,yi+1)
(a) (b)
vu=(xi+1-xi,yi+1-yi)vu=(xi+1-xi,yi+1+S-yi)
(xi+1,yi+1+S)
(c) (d)
Figure 2. Adjustment of clicked point for mouse wheel scroll
m
u u
Desired Log
User Log u
m
u u
Desired Log
User Log u
mMissed operations
u Unnecessary operations
Two operations judges as the same
mMissed operations
u Unnecessary operations
Two operations judges as the same
Figure 3. Unnecessary operations and missed operations
Missed operations are desired operations judged as not included in the user interaction
sequences, i.e., missed operations are operations in desired logs for which any operation in a
user log is not judged as the same one. Our method supposes such user operations as
missed because the operations may be necessary for completing the test task but the user
finished the task without performing the operations. Missed operations can be cues for
human evaluators to find usability problems that a link is not clear enough or not easy to
find for users. Our method models an operation in a user/desired log by a vector derived
from clicked point logs, so the method detects unnecessary/missed operations as
unnecessary/missed vectors.
Suppose two or more successive operations are unnecessary ones in a user log. In this case,
the first operation is likely to be the best cue in the successive unnecessary operations. This
is because the user might deviate from the desired sequence by the first operation (i.e., the
expected operation is not clear enough for the user) and had performed additional
operations irrelevant to the test task until the user returned to the desired sequence. Our
method can extract the first operations in the successive unnecessary operations and show
them to human evaluators so that the evaluators can efficiently analyze usability problem
cues (unnecessary operations in this case).
The idea of analyzing unnecessary/missed operations in user interaction logs
(inconsistencies in user/desired logs) is not novel (e.g., Okada & Asahi, 1996), but our
method described in this section is unique in that it extracts those inconsistent operations
from logs of clicked points (logs of (x, y) coordinate values), not widget-level logs.
Automated Methods for Webpage Usability & Accessibility Evaluations
355
2.1.4 Unnecessary/missed operations common to users
Unnecessary/missed operations common in many of test users are useful cues for finding
problems less independently of individual differences among the users. Our method
analyzes how many users performed the same unnecessary/missed operation.
The analysis of the user ratio for the same missed operation is simple: for each missed
operation, the number of user logs that do not include the desired operation is counted. To
analyze the user ratio for the same unnecessary operation, the method compares unnecessary
operations extracted from all user logs of the test task. This comparison is achieved by the
same way as operations (vectors) in user/desired logs are compared. By this comparison,
unnecessary operations common among multiple users can be extracted (Fig. 4).
User Logs
User1
User2
User3
User1
User2
User3
User1
User2
User3
User1
User2
User3
Unnecessary
Operatio ns
Unnecessary Operations
Common to Users
Desired Log
Vector Comparison
Extracti on
Vector Compariso n
Figure 4. Extraction of unnecessary operations common to users
2.2 Evaluating effectiveness based on case study
2.2.1 Design of experiment
Ten websites of business/public organizations were selected. For each site, a test task was
designed. The average number of clicks in the designed sequences for the ten test tasks was
3.9. Five university students participated in this experiment as test users. Each test user was
asked to perform the task on each site. They had enough knowledge and experience in using
webpages with a PC web browser but they used the websites for the first time. The desired
sequences of the test tasks were not told to the test users. Thus, if the desired sequences are
not clear enough for the test users, the users are likely to deviate from the desired sequences
and unnecessary and/or missed operations are observed. The interaction of each user for
each test task was logged into a user log file. To avoid fatigue affecting the results, the time
of experiment for each user was limited to 60+ minutes: each test user was asked to perform
a test task within five or ten minutes depending on the task size.
Fifty user logs (five users * ten tasks) and ten desired logs (a log per test task) were collected.
For each task, our tool implementing the proposed method analyzed the logs and extracted
possible cues of usability problems (i.e., unnecessary/missed operations). An evaluator tried
to find usability problems based on the extracted cues.
2.2.2 Weight factors and thresholds for vector distance
Our method requires us to determine the values of weight factors wx, wy, wp and wv and the
threshold value of vector distance (see 2.1.2). To determine these values, a pre-experiment
was conducted with another test user. Based on the analysis of the log files collected by the
pre-experiment, appropriate values were determined that led an accurate result in detecting
unnecessary/missed operations. Values in the row labeled as “Original” in Table 1 show the
obtained values.
Advances in Human-Computer Interaction
356
w
x w
ywpwvThreshold
(pixel)
Original 0.4 1.0 0.5 1.0 100
Variation A 0.4 1.0 0.0 1.0 67
Variation B 0.4 1.0 1.0 0.0 34
Table 1. Values of weight factors and threshold for vector distance
In our method, distance of two operations (vectors) is defined by Eqs. (1)-(3). In the case
where wv = 0 and wp = 1, D(vu, vd) = Dp so that two operations are not compared on the
basis of two successive clicks (i.e., on the basis of vectors) but compared on the basis of a
single click. In this case, a click in a user log and a click in a desired log are judged as the
same operation if the clicked points are near. Similarly, in the case where wp = 0 and wv = 1,
D(vu, vd) = Dv so that two operations are compared by the size of vector difference only (i.e.,
the absolute position on which the click is performed in the screen is not considered). These
two variations of the method were also evaluated in addition to the original one. Variations
A/B in Table 1 denote them in which (wp, wv) = (0.0, 1.0) and (1.0, 0.0), respectively.
2.2.3 Number of problems found
To evaluate the effectiveness of our method in finding usability problems, the number of
problems found by the method were compared with the number by a method based on
manual observation of user interactions. In addition to record click logs in user interaction
sessions, PC screen image had been captured to movie files (a screen recorder program was
used in the PC). A human evaluator observed user interactions with the replay of the
captured screen movies and tried to find usability problems. This manual method is
expected to require much time for the interaction observation but contribute to find
problems thoroughly. In this experiment, the evaluator who tried to find problems by the
proposed method and the evaluator who tried to find problems by the manual method were
different so that the result with a method did not bias the result with another method.
Table 2 shows the number of problems found by each of the methods. The values in the
table are the sum for the ten test tasks (sites). Eleven problems were shared in the four sets
of the problems, i.e., the proposed (original) method contributed to find 61% (=11/18) of the
problems found by the manual method. Although the number of problems found by the
proposed method was smaller than the manual method, the time for a human evaluator to
find the problems by the proposed method was much less than the time by the manual
method. In the case of the manual method, an evaluator had to investigate all clicks by the
users because user clicks that would be possible problem cues were not automatically
extracted. In the case of the proposed method, an evaluator required to investigate smaller
number of clicks extracted as possible problem cues by the method. In this experiment, the
number of clicks to be investigated in the case of the proposed method was 10-20% of the
number in the case of the manual method.
This result of case study indicates that the proposed method will
contribute to find usability problems to a certain extent in terms of the number of
problems, and
be much efficient in terms of the time required.
Automated Methods for Webpage Usability & Accessibility Evaluations
357
Method #Problems
Manual 18
Original 15
Variation A 14
Variation B 13
Table 2. Number of problems found by each method
2.2.4 Unnecessary/missed operations contributing to finding problems
Not all unnecessary/missed operations extracted by the proposed method may contribute
to finding usability problems. As the number of unnecessary/missed operations that
contribute to finding problems is larger, the problems can be found more efficiently. The
contribution ratio was investigated for the proposed method and its variations (Table 3).
Values in the “Counts (first)” column are the counts of unnecessary operations that were the
first in two or more successive unnecessary operations (see 2.1.3). For example, the original
method extracted four missed operations in total from log files of the ten test tasks, and
25.0% (one) of the four operations contributed to finding a problem. Similarly, the original
method extracted 375 unnecessary operations in total, and 7.5% (28) of the 375 operations
contributed to finding problems.
Missed Operations Unnecessary Operations
Method Counts Ratio
Counts
(all) Ratio Counts
(first) Ratio
Original 4 25.0% 375 7.5% 51 49.0%
Variation A 8 37.5% 422 5.7% 58 39.7%
Variation B 1 0.0% 299 9.4% 72 37.5%
Table 3. Number of unnecessary and missed operations found by each method and the ratio
of contribution in finding problems
Findings from the result in Table 3 are as follows.
In the three methods, the ratios were larger for unnecessary operations (first) than those
for unnecessary operations (all). This result supports our idea that an evaluator can find
usability problems more efficiently by analyzing the first operations only in successive
unnecessary operations.
In the result of unnecessary operations (first), the ratio for the original method was
larger than either of the two variations. In the result of missed operations, the ratio for
the original method was larger than that for the variation B but smaller than that for the
variation A. This indicates that both the original method and its variation A are
promising ones.
See Section 4 for the conclusion of this research on the log-based usability evaluation
method.
3. Automated Method for Webpage Accessibility Evaluation
3.1 Basic idea
<Table> tags are used for (a) expressing data tables as the tags are originally designed for or
(b) adjusting the layout of document contents. In the case where a <table> tag is used for the
Advances in Human-Computer Interaction
358
table-purpose, the data in the same row or column are semantically related with each other
(e.g., values of the same property for several data items, or values of several properties for
the same data item) and the relation can be expressed by row/column headers. On the
contrary, in the case where a <table> tag is used for the layout-purpose, the data in the same
row or column may not be semantically related. Thus, to deduce the purpose of a <table>
tag in a fundamental approach, it should be analyzed whether or not the data in the same
row/column of the table are the semantically related ones. To make the analyses automated
by a computer program requires a method for semantic analyses of table contents, but the
automated semantic analyses with enough precision and independency for any kinds of
webpages is hard to achieve.
Our basic idea focuses on machine-readable design attributes of a table instead of analyzing
semantics of data in a table. If some common design pattern is found in some attribute
values of layout-purpose <table> tags and another common design pattern is found in the
attribute values of table-purpose <table> tags, the two purposes can be discriminated by
denoting classification rules based on the design patterns. However, it is unknown whether
we can find such design patterns, and even if we can, it will be hard for us to manually
investigate common patterns by analyzing design attribute values in a large number of
<table> tag instances and denote sufficient rules to precisely classify the tags. In our
research, the authors manually investigate design attributes only: to automatically derive
classification rules, a machine learning method is utilized.
Among several kinds of machine learning methods available, ID3 (Quinlan, 1986), a method
for deriving a decision tree from a set of data instances, is utilized. An advantage of a
decision tree as a classifier over other forms of classifiers (e.g., a multi-layered neural
network) is that classification rules are obtained as tree-formed explicit if-then rules and
thus easy to read for human users of the method.
3.2 <Table> tag design attributes for classification rules
The authors first collected and investigated <table> tag instances (webpage HTML sources
in which <table> tags were included) and extracted design attributes of <table> tags that
were likely to contribute to the classification. Webpages were collected from various
categories in Yahoo webpage directory so that the pages were not biased from the viewpoint
of page categories. The authors manually judged purposes of the <table> tag instances in the
collected pages. By this way, 200 <table> tags were collected, a half of which were layout-
purpose ones and the others table-purpose ones. The authors then extracted common design
patterns for each set of <table> tags. Findings were as follows.
Common design patterns in layout-purpose <table> tags
<Table> tags are nested.
Some cell(s) in the table include(s) image(s).
The number of HTML tags that appear before the <table> tag in the source is small.
Some cells in the table are spanned.
The width and/or height are/is specified.
Common design patterns in table-purpose <table> tags
The table has visible border lines.
The table includes many rows.
<Th> tags for row/column headers are included.
Table titles are specified.
Automated Methods for Webpage Usability & Accessibility Evaluations
359
By denoting classification rules based on these common design patterns, it will be possible
to deduce the purposes of <table> tags to a certain extent. To derive the rules in a form of
decision tree by ID3, 10 attributes in Table 4 were determined.
Binary values for the eight attributes of “border”-“width” mean whether or not the
table has visible borders, captions, etc. For example, if the border attribute is specified
for the <table> tag and the attribute value is 1 or more then border = Y (Yes), else
border = N (No).
A value of the attribute “num_tag” is the number of HTML tags that appear before the
<table> tags in the source. In counting the tags, those in the header part are not
counted.
A value of the attribute “num_tr” is the number of rows (<tr> tags) in the table.
Name Meaning Values
border Whether the table has visible border lines.
caption Whether the table has a caption.
height Whether the table height is specified.
img Whether a cell in the table includes an item of image data.
nest Whether the table includes a nested table in itself.
span Whether some cells are spanned.
th Whether a row/column has a title header for the data
in the row/column.
width Whether the table width is specified.
Binary
num_tag The number of HTML tags that appear before the <table>
tags in the source.
num_tr The number of rows in the table.
Positive
integer
Table 4. Attributes for decision tree
3.3 System configuration for the proposed method
Learning Subsystem Checking Subsystem
Web Pages
(for Evaluation)
Classification of
<table> Purpose
<table>
Attribute Analyses
Classification
Results
Web Pages
(for Evaluation)
Classification of
<table> Purpose
<table>
Attribute Analyses
Classification
Results
Web Pages
(for Learning)
<table>
Attribute Analyses
Attribute Values
of each <table>
Given Purpose
for each <table>
Machine Learning
(ID3)
Decision Tree
Figure 5. System configuration for the proposed method
Advances in Human-Computer Interaction
360
Fig. 5 shows the system configuration for our method. In the learning phase, the method
derives rules for classifying <table> tags from a set of webpages in which layout/table-
purpose <table> tags are included. First, for each <table> tag in the learning data, attribute
values are obtained by analyzing the webpage HTML source in which the tag is included.
The purpose of each tag in the learning data is given by manual judgments in collecting the
learning data. From these data of attribute values and given purposes, classification rules
are derived by a machine learning method. In the case where ID3 is applied as the learning
method, the rules are obtained as a decision tree (i.e., tree-formed if-then rules). In the
checking phase, <table> tags (of which the purpose is unknown) are classified by the rules
obtained in the learning phase. Attribute values of each <table> tag for the classification are
obtained by the same way as in the learning phase (the <table> attribute analyses module in
Fig. 5 is shared by the learning and checking subsystems). The obtained attribute values are
applied to if parts of the rules, and the purpose of each <table> is deduced by the rule of
which the if part is satisfied.
3.4 Evaluation of the proposed method
3.4.1 Evaluation method
To evaluate the effectiveness of our method, the authors investigated the classification
accuracy by 10-fold cross validation (CV) with the 200 <table> tags collected (see 3.2). The
200 tags were randomly divided into 10 groups (G1, G2, …, G10). A decision tree was
derived by ID3 applying to 180 <table> tags in 9 groups excluding Gi, and the classification
accuracy was checked with 20 <table> tags in Gi (i=1,2,…,10). The accuracy rate was
calculated as follows.
Accuracy Rate = (Number of <table> tags correctly classified)/20 (4)
Ten values of the accuracy rate were obtained by a trial of 10-fold CV. The authors tested the
10-fold CV three times and statistically investigated the accuracy rates.
3.4.2 Decision trees obtained by ID3
By the three trials of 10-fold CVs, 30 decision trees were obtained in total. Examples of the
decision trees are shown in Tables 5-7 (only the nodes in the depth 3 are shown). Values in
the “No.” column are the serial numbers of the nodes where the #0 node is the root node.
Tables 5-7 denote parent-child node relationships by indents in the “Rule Element” column.
For example, in Table 5, the #1 and #8 nodes are child nodes of the #0 node and the #2 and
#5 nodes are child nodes of the #1 node. Values in the “Rule Element” column are the
conditions in if-parts of rules. Conditions that appear in a path from the root node to a leaf
node are connected with AND. Values in the “Table” and “Layout” columns are the
numbers of table/layout-purpose <table> tags in the learning data included in the node. For
example, Table 5 shows the followings.
The root node includes 89 table-purpose tags and 91 layout-purpose tags (see #0 node).
Of the 180 <table> tags in the root node,
those with border=N are 86. Nine tags are table-purpose ones and the other 77 tags
are layout-purpose ones (see #1 node), and
those with border=Y are 94. Eighty tags are table-purpose ones and the other 14
tags are layout-purpose ones (see #8 node).
Leaf nodes are those of which the value in the “Table” column or the “Layout” column
Automated Methods for Webpage Usability & Accessibility Evaluations
361
is 0. A leaf node corresponds to an if-then rule. For example, the #3 node denotes that
all tags that meet the condition:
(border = N) and (num_tr <= 7) and (num_tag <= 12)
are layout-purpose ones. Thus, the following rule is obtained from the #3 node.
“If (border = N) and (num_tr <= 7) and (num_tag <= 12) then the tag is a layout-
purpose one.”
The examples of decision trees shown in Tables 5-7 are typical ones in the 30 trees obtained. The
other trees had similar structures from the root node to nodes in depth 3 as either of the three
examples. The 30 trees reveal that the attributes border, num_tr, num_tag and nest is likely to
appear in the upper layers of the tree, i.e., these attributes well contribute to the classification.
No. Rule Element Table Layout Total
0 (root) 89 91 180
1border = N 9 77 86
2 num_tr <= 7 1 71 72
3 num_tag <= 12 0 66 66
4 num_tag > 12 1 5 6
5 num_tr > 7 8 6 14
6nest = N 819
7nest = Y 055
8 border = Y 80 14 94
9 nest = N 78 5 83
10 img = N 59 2 61
11 img = Y 19 3 22
12 nest = Y 2 9 11
13 height = N 0 5 5
14 height = Y 2 4 6
Table 5. Example (1) of decision trees obtained in the experiment
No. Rule Element Table Layout Total
0 (root) 89 91 180
1 num_tr <= 6 15 82 97
2border = N 3 6972
3 img = N 3 18 21
4 img = Y 0 51 51
5 border = Y 12 13 25
6 nest = N 12 5 17
7nest = Y 088
8 num_tr > 6 74 9 83
9 nest = N 72 1 73
10 border = N 9 1 10
11 border = Y 63 0 63
12 nest = Y 2 8 10
13 num_tag <= 7 0 8 8
14 num_tag > 7 2 0 2
Table 6. Example (2) of decision trees obtained in the experiment
Advances in Human-Computer Interaction
362
No. Rule Element Table Layout Total
0 (root) 86 94 180
1 num_tr <= 6 15 84 99
2 num_tag <= 12 3 76 79
3 img = N 3 20 23
4 img = Y 0 56 56
5 num_tag > 12 12 8 20
6 nest = N 12 4 16
7nest = Y 044
8 num_tr > 6 71 10 81
9 nest = N 69 1 70
10 border = N 10 1 11
11 border = Y 59 0 59
12 nest = Y 2 9 11
13 border = N 0 8 8
14 border = Y 2 1 3
Table 7. Example (3) of decision trees obtained in the experiment
3.4.3 Evaluation of classification accuracy
Accuracy rates obtained by the three trials of the 10-fold CVs are shown in Table 8. Ten
values of the accuracy rates are obtained by a single trial of CV, and the values in Table 8 are
the minimum, maximum, mean and SD values of the ten accuracy rates for each trial. In all
of the three trials, the maximum rate was 100% so that all checking <table> tags were
correctly classified. The mean values were around 90% and the SD values were small, which
supports the effectiveness of our method. Improvements for better values of the minimum
accuracy rates are the further research challenges.
4. Conclusions
In this chapter, Section2 described our method that extracts cues for finding usability
problems from user/desired logs of clicked points. To detect inconsistencies between user
and desired logs, the method compares operations in the logs. The method compares
user/desired operations by modeling each operation as a vector derived from coordinate
values of the clicked points and checking the distance between two vectors. The distance
was defined as a weighted sum of distance between start points and size of difference for
the two vectors. The method extracts two types of inconsistencies: unnecessary and missed
operations. Effectiveness of the proposed method was evaluated based on a case study. Each
of the two human evaluators tried to find usability problems for ten websites by the
proposed method and the manual method respectively. The proposed method contributed
to find 61% of the usability problems found by the manual method in much smaller amount
of time: the number of clicks analyzed by an evaluator with the proposed method was only
10-20% of that with the manual method. This result indicates that the method will help
evaluators to quickly and roughly focus their attentions to problems cues in user
interactions. In our future work, additional case studies are necessary for further evaluations
and improvements of the method.
Automated Methods for Webpage Usability & Accessibility Evaluations
363
1st CV 2nd CV 3rd CV
Min. 85 80 80
Max. 100 100 100
Mean 92 89 94
SD 6.0 7.5 6.7
Table 8. Classification accuracy rates (%)
Section 3 described our method for detecting layout-purpose <table> tags in webpage
HTML sources. A machine learning method was utilized for deriving <table> tag classifiers.
A system was developed that utilized ID3 as the machine learning method. The system
derives a decision tree as the classifier from a set of <table> tag data for learning.
Classification accuracy was evaluated by 10-fold CVs with 200 webpages collected from the
web. It was found that the purposes could roughly be discriminated with attributes of
border, num_tr, num_tag and nest shown in Table 4: these attributes were likely to appear
in upper layers in decision trees. In the experiment with the 200 <table> tags collected,
mean accuracy rates were around 90%. In our future work, it will be investigated whether
machine learning methods other than ID3 (e.g., C4.5, multi-layered neural network, support
vector machine) can improve the accuracy. These methods will be utilized in our system and
classification accuracy rates will be compared within the methods.
5. References
Abascal, J.; Arrue, M.; Fajardo I.; Garay, N. & Tomas, J. (2004). Use of guidelines to
automatically verify web accessibility, Universal Access in the Information Society,
Vol.3, No.1, pp.71-79.
Al-Qaimari, G. & Mcrostie, D. (1999). KALDI: a computer-aided usability engineering tool
for supporting testing and analysis of human computer interaction, Proc. of the
Third Int. Conf. on Computer-Aided Design of User Interfaces, pp.337-355.
Beirekdar, A.; Keita, M.; Noirhomme, M.; Randolet, F.; Vanderdonckt, J. & Mariage, C.
(2005). Flexible reporting for automated usability and accessibility evaluation of
web sites, Lecture Notes in Computer Science, Vol.3585 (Proc. of 10th IFIP TC 13 Int.
Conf. on Human-Computer Interaction (INTERACT2005)), pp.281-294.
Brajnik, G. (2004). Comparing accessibility evaluation tools: a method for tool effectiveness,
Universal Access in the Information Society, Vol.3, Nos.3-4, pp.252-263.
Brinck, T.; Hermann, D.; Minnebo, B. & Hakim, A. (2002). AccessEnable: a tool for
evaluating compliance with accessibility standards, CHI’2002 Workshop on
Automatically Evaluating the Usability of Web Sites, available at http://simplytom.
com/research/AccessEnable_workshop_paper.pdf.
Cooper, M. (1999). Evaluating accessibility and usability of web sites, Proc. of 3rd Int. Conf. on
Computer-Aided Design of User Interfaces (CADUI’99), pp.33-42.
Gellner, M. & Forbrig, P. (2003). ObSys - a tool for visualizing usability evaluation patterns
with Mousemaps, Proc. of the Tenth Int. Conf. on Human-Computer Interaction, pp.469-
473.
Helfrich B. & Landay, J. A. (1999). QUIP: quantitative user interface profiling, available at
http://www.helcorp.com/bhelfrich/helfrich99quip.pdf.
Advances in Human-Computer Interaction
364
Ivory, M. Y. (2003). Automated web site evaluation: researchers’ and practitioners’ perspectives,
Kluwer Academic Publishers.
Ivory, M. Y. & Hearst, M. A. (2001). The state of the art in automated usability evaluation of
user interfaces, ACM Computing Surveys, Vol.33, No.4, pp.1-47.
Ivory, M. Y.; Mankoff, J. & Le, A. (2003). Using automated tools to improve web site usage
by users with diverse abilities, IT&Society, Vol.1, No.3, pp.195-236. available at
http://www.stanford.edu/group/siqss/itandsociety/v01i03/v01i03a11.pdf.
Ivory, M. Y. & Megraw, R. (2005). Evolution of web site design patterns, ACM Trans. on
Information Systems, Vol.23, No. 4, pp.463-497.
Kishi, N. (1995). Analysis tool for skill acquisition with graphical user interfaces based on
operation logging, Proc. of the 6th Int. Conf. on Human-Computer Interaction, pp.161-
166.
Okada, H. & Asahi, T. (1996). GUITESTER: a log-based usability testing tool for graphical
user interfaces, IEICE Transaction on Information and Systems, Vol.E82-D, No.6,
pp.1030-1041.
Perlman, G. (2008). Accessibility links – tools, HCI Bibliography, available at http://www.
hcibib.org/accessibility/#TOOLS.
Qiang, Y.; Suzuki, T.; Sakuragawa, S.; Tamura, H. & Kurosu, M. (2005). The development of
OBSERVANT EYE to effective implement observation records for usability testing,
Proc. of the 11th Int. Conf. on Human-Computer Interaction, CD-ROM.
Quinlan, J. R. (1986). Induction of decision trees, Machine Learning, Vol.1, pp.81-106.
Scapin, D.; Leulier, C.; Vanderdonckt, J.; Mariage, C.; Bastien, C.; Farenc, C.; Palanque, P. &
Bastide, R. (2000). A Framework for organizing web usability guidelines, Proc. of the
6th Conf. on Human Factors & the Web, available at http://www.isys.ucl.ac.be/
bchi/publications/2000/Scapin-HFWeb2000.htm.
Uehling, D. L. & Wolf, K. (1995). User action graphing effort (UsAGE), Proc. of the Conf. on
Human Factors in Computing Systems, pp.290-291.
Vanderdonckt, J. & Beirekdar, A. (2005). Automated web evaluation by guideline review,
Journal of Web Engineering, Vol.4, No.2, pp.102-117.
W3C. (1999a). HTML 4.01 specification, available at http://www.w3.org/TR/html4/.
W3C. (1999b). Web content accessibility guidelines 1.0, available at http://www.w3.org/
TR/WCAG10/.
W3C. (2006). Complete list of web accessibility evaluation tools, available at http://www.
w3.org/WAI/ER/tools/complete.
Zettlemoyer, L. S.; St.Amant, R. S. & Dulberg, M. S. (1999). IBOTS: Agent control through the
user interface, Proc. of the Int. Conf. on Intelligent User Interfaces, pp.31-37.
... Okada et al. [76] describe two methods for automated evaluation of usability and accessibility of web pages. Usability is evaluated using analysis of logs resulting from user interaction. ...
... Some approaches bring only minor level of automation, e.g. they help with construction of consistent usability reports - [34,92,95]. Many approaches require the presence of human experts - [95,19] or are based on measured real user activity [76,92]). Few approaches that o↵er true automation of usability estimation exist - [13,38]. ...
Thesis
Full-text available
The development and broad public dissemination of modern information and communication technologies implies a need for new requirements to corresponding user interfaces. There is a significant rise of diversity in physical form-factors of interactive devices, as well as the underlying software platforms on which those devices are based. The vast majority of present society, including people with special needs, requires access to modern information technologies. This trend represents a challenge for corresponding user interface technologies. Corresponding technologies should offer a sufficient level of adaptivity and context-sensitivity while preserving reasonable development costs. The requirements for this dissertation thesis have been coordinated with the results of an extensive survey of the state of art in this field. Our approach represents a method for context-sensitive automatic user interface generation. It relies on several methods introduced in the framework of this thesis. Our method for the user interface description and delivery supports consistent models on different levels of abstraction, as well as a seamless integration of the individual components of our system. As part of our approach, we suggest a context model that reduces development and maintenance complexity by the introduction of a concept of consistent independent sub-models. Our automatic user interface generation method uses optimization techniques to provide usable resulting user interfaces, minimizing the user effort required. The validity and efficiency of our approach are supported by an evaluation based on user studies, as well as an analysis of development efforts. Furthermore, this dissertation thesis describes a number of practical applications of our approach.
... Regarding the automation of usability tests, several previous projects can be identified. For example, automated testing of usability and accessibility of web pages has been proposed by Okada et al. [14]. Here, the proposed system collects logs from users' interaction with the SUT. ...
Preprint
Full-text available
As the popularity of Smart Televisions (TVs) and interactive Smart TV applications (apps) has recently grown, the usability of these apps has become an important quality characteristic. Previous studies examined Smart TV apps from a usability perspective. However, these methods are mainly manual, and the potential of automated model-based testing methods for usability testing purposes has not yet been fully explored. In this paper, we propose an approach to test the usability of Smart TV apps based on the automated generation of a Smart TV user interaction model from an existing app by a specialized automated crawler. By means of this model, defined user tasks in the Smart TV app can be evaluated automatically in terms of their feasibility and estimated user effort, which reflects the usability of the analyzed app. This analysis can be applied in the context of regular users and users with various specific needs. The findings from this model-based automated analysis approach can be used to optimize the user interface of a Smart TV app to increase its usability, accessibility, and quality.
... Regarding the automation of usability tests, several previous projects can be identified. For example, automated testing of usability and accessibility of web pages has been proposed by Okada et al. [14]. Here, the proposed system collects logs from users' interaction with the SUT. ...
Article
Full-text available
As the popularity of Smart Televisions (TVs) and interactive Smart TV applications (apps) has recently grown, the usability of these apps has become an important quality characteristic. Previous studies examined Smart TV apps from a usability perspective. However, these methods are mainly manual, and the potential of automated model-based testing methods for usability testing purposes has not yet been fully explored. In this paper, we propose an approach to test the usability of Smart TV apps based on the automated generation of a Smart TV user interaction model from an existing app by a specialized automated crawler. By means of this model, defined user tasks in the Smart TV app can be evaluated automatically in terms of their feasibility and estimated user effort, which reflects the usability of the analyzed app. This analysis can be applied in the context of regular users and users with various specific needs. The findings from this model-based automated analysis approach can be used to optimize the user interface of a Smart TV app to increase its usability, accessibility, and quality.
... In a similar line of work, Okada and Fujioka proposed a method to detect usability problems from mouse traces (unnecessary and missed operations), by automatically comparing user logs with optimal logs of clicked points, gathered from tasks performed in an optimal way (Okada & Fujioka 2008). The authors report that the method can find 61% of the problems found manually in a much shorter time. ...
Article
Full-text available
Usability assessment of web applications continues to be an expensive and often neglected practice. While large companies are able to spare resources for studying and improving usability in their products, smaller businesses often divert theirs in other aspects. To help these cases, researches have devised automatic approaches for user interaction analysis, and there are commercial services that offer automated usability statistics at relatively low fees. However, most existing approaches still fall short in specifying the usability problems concretely enough to identify and suggest solutions. In this work we describe usability smells of user interaction, i.e., hints of usability problems on running web applications, and the process in which they can be identified by analyzing user interaction events. We also describe USF, the tool that implements the process in a fully automated way with minimum setup effort. USF analyses user interaction events on-the-fly, discovers usability smells and reports them together with a concrete solution in terms of a usability refactoring, providing usability advice for deployed web applications.
Conference Paper
The refactoring technique helps developers to improve not only source code quality, but also other aspects like usability. The problems refactoring helps to solve in the specific field of web usability are considered to be issues that make common tasks complicated for end users. Finding such problems, known in the jargon as bad smells, is often challenging for developers, especially for those who do not have experience in usability. In an attempt to leverage this task, we introduce a tool that automatically finds bad usability smells in web applications. Since bad smells are catalogued in the literature together with their suggested refactorings, it is easier for developers to find appropriate solutions.
Article
Full-text available
The World Wide Web plays an important role in our society—enabling broader access to information and services than was previously available. However, Website usability and accessibility are still a problem. Numerous, automated, evaluation and transformation tools are available to help Web developers build better sites for users with diverse needs. A survey of these automated tools is presented in the context of the user abilities they support. Furthermore, the efficacy of a subset of these tools based on empirical studies is discussed, along with ways to improve existing tools and future research areas.
Conference Paper
Full-text available
A system for automatically evaluating the usability and accessibility of web sites by checking their HTML code against guidelines has been developed. All usability and accessibility guidelines are formally expressed in a XML-compliant specification language called Guideline Definition Language (GDL) so as to separate the evaluation engine from the evaluation logics (the guidelines). This separation enables managing guidelines (i.e., create, retrieve, update, and delete) without affecting the code of the evaluation engine. The evaluation engine is coupled to a reporting system that automatically generates one or many evaluation reports in a flexible way: adaptation for screen reading or for a printed report, sorting by page, by object, by guideline, by priority, or by severity of the detected problems. This paper focuses on the reporting system.
Article
BobbySM is a computer-based tool that supports Universal Design of a specific computer-human interface, the World Wide Web. The notion of a universally designed Web challenges society to think about plurality—to consider all individuals, regardless of age, ability, race, or economic or cultural background—when developing new technologies. Yet at this time, though the Web has much potential for broad inclusion, it often excludes some people from participating in much the same way that a staircase prevents a person in a wheelchair from going in a building’s door. In response to this, CAST launched Bobby in August 1996. Bobby teaches Web designers about Web accessibility by displaying accessibility problems directly within the context of the designer’s site. Bobby is a teaching tool, helping Web designers (including the rapidly growing number of non-professional designers who are putting up sites) learn skills they can apply later. The more people use Bobby, the more likely they will be to integrate accessibility principles into their way of thinking and the irstyle of Web site design.
Article
Observing users working with a system is essential for better user interface design. One of popular observation methods is logging users' operations, i.e., having the system collect the users' operations automatically. However, it is known that operation logs, particularly the ones collected in the systems with graphical user interfaces(GUI), can be very large and difficult to extract meaningful information.We developed a set of tools, collectively named SimUI, to log and analyze users' operations on applications with GUI. SimUI's analysis tool uses a technique, named multi-step matching, to detect differences between two sets of operation logs. This paper describes the use of multi-step matching in a experiment, in which the subjects are asked to learn to use a bitmap editor. In the experiment, we observed that the ‘matching rate,’ measured by multi-step matching, can indicate various type of skill acquisition, compared to statistics about the task completion time and the frequency of command use.
Article
Usability evaluation is an increasingly important part of the user interface design process. However, usability evaluation can be expensive in terms of time and human resources, and automation is therefore a promising way to augment existing approaches. This article presents an extensive survey of usability evaluation methods, organized according to a new taxonomy that emphasizes the role of automation. The survey analyzes existing techniques, identifies which aspects of usability evaluation automation are likely to be of use in future research, and suggests new ways to expand existing approaches to better support usability evaluation.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Conference Paper
This paper describes an ibot, a specialized software agent that exists in the environment of the user interface. Such an agent interacts with applications through the same medium as a human user. Its sensors process screen contents and mouse/keyboard events to monitor the user's actions and the responses of the environment, while its effecters can generate such events for its own contributions to the interaction. We describe the architecture of our agent and * its algorithms for image processing, event management, and state representation. We illustrate the use of the agent with a small feasibility study in the area of software logging; results are promising for future progress.
Conference Paper
This paper describes a prototype usability test tool which will automate detection of serious usability problems. The tool records the actions that a user makes while performing a predefined application task. Currently the tool supports only user interfaces created with TAE Plus. Prior to a usability testing session, an “expert” user is recorded performing a task. The recording becomes a performance baseline. Later, during actual usability testing, a “novice” user is recorded performing the same task. The action recordings of the two users are then compared by the tool and the comparison results are shown graphically. The hypothesis is that by graphically comparing the actions of an expert and average novice users, a usability analyst can quickly figure out where usability problems (e.g. confusion with menu choices) arise with the user interface.