Conference PaperPDF Available

Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images

Authors:

Abstract

Graphical User Interface (GUI) is not merely a collection of individual and unrelated widgets, but rather partitions discrete widgets into groups by various visual cues, thus forming higher-order perceptual units such as tab, menu, card or list. The ability to automatically segment a GUI into perceptual groups of widgets constitutes a fundamental component of visual intelligence to automate GUI design, implementation and automation tasks. Although humans can partition a GUI into meaningful perceptual groups of widgets in a highly reliable way, perceptual grouping is still an open challenge for computational approaches. Existing methods rely on ad-hoc heuristics or supervised machine learning that is dependent on specific GUI implementations and runtime information. Research in psychology and biological vision has formulated a set of principles (i.e., Gestalt theory of perception) that describe how humans group elements in visual scenes based on visual cues like connec-tivity, similarity, proximity and continuity. These principles are domain-independent and have been widely adopted by practitioners to structure content on GUIs to improve aesthetic pleasant and usability. Inspired by these principles, we present a novel unsu-pervised image-based method for inferring perceptual groups of GUI widgets. Our method requires only GUI pixel images, is independent of GUI implementation, and does not require any training data. The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hoc heuristics-based baseline. Our perceptual grouping method creates the opportunities for improving UI-related software engineering tasks.
Psychologically-Inspired, Unsupervised Inference of Perceptual
Groups of GUI Widgets from GUI Images
Mulong Xie
mulong.xie@anu.edu.au
Australian National University
Australia
Zhenchang Xing
zhenchang.xing@anu.edu.au
CSIRO’s Data61 & Australian
National University
Australia
Sidong Feng
sidong.feng@monash.edu
Monash University
Australia
Chunyang Chen
chunyang.chen@monash.edu
Monash University
Australia
Liming Zhu
liming.zhu@data61.csiro.au
CSIRO’s Data61 & School of CSE,
UNSW
Australia
Xiwei Xu
xiwei.xu@data61.csiro.au
CSIRO’s Data61
Australia
ABSTRACT
Graphical User Interface (GUI) is not merely a collection of individ-
ual and unrelated widgets, but rather partitions discrete widgets
into groups by various visual cues, thus forming higher-order per-
ceptual units such as tab, menu, card or list. The ability to automat-
ically segment a GUI into perceptual groups of widgets constitutes
a fundamental component of visual intelligence to automate GUI
design, implementation and automation tasks. Although humans
can partition a GUI into meaningful perceptual groups of widgets in
a highly reliable way, perceptual grouping is still an open challenge
for computational approaches. Existing methods rely on ad-hoc
heuristics or supervised machine learning that is dependent on
specic GUI implementations and runtime information. Research
in psychology and biological vision has formulated a set of princi-
ples (i.e., Gestalt theory of perception) that describe how humans
group elements in visual scenes based on visual cues like connec-
tivity, similarity, proximity and continuity. These principles are
domain-independent and have been widely adopted by practition-
ers to structure content on GUIs to improve aesthetic pleasant and
usability. Inspired by these principles, we present a novel unsu-
pervised image-based method for inferring perceptual groups of
GUI widgets. Our method requires only GUI pixel images, is inde-
pendent of GUI implementation, and does not require any training
data. The evaluation on a dataset of 1,091 GUIs collected from 772
mobile apps and 20 UI design mockups shows that our method sig-
nicantly outperforms the state-of-the-art ad-hoc heuristics-based
baseline. Our perceptual grouping method creates the opportunities
for improving UI-related software engineering tasks.
1 INTRODUCTION
We do not just see a collection of separated texts, images, buttons,
etc., on GUIs. Instead, we see perceptual groups of GUI widgets, such
as card, list, tab and menu shown in Figure 1. Forming perceptual
groups is an essential step towards visual intelligence. For example,
it helps us decide which actions are most applicable to certain
GUI parts, such as, clicking a navigation tab, expanding a card,
scroll the list. This would enable more ecient automatic GUI
testing [
19
,
28
]. As another example, screen readers [
3
,
8
] help
visually impaired users access applications by reading out content
on GUI. Recognizing perceptual groups would allow screen readers
Figure 1: Examples of perceptual groups of GUI widgets (per-
ceptual groups are highlighted in pink box in this paper)
Figure 2: Implemented view hierarchy does not necessarily
correspond to perceptual groups
to navigate the GUI at higher-order perceptual units (e.g., sections)
eciently [
51
]. Last but not least, GUI requirements, designs and
implementations are much more volatile than business logic and
functional algorithms. With perceptual grouping, modular, reusable
GUI code can be automatically generated from GUI design images,
which would expedite rapid GUI prototyping and evolution [
30
,
31
].
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
Figure 3: Left: Our approach overview: (1) Enhanced UIED [
18
] for GUI widget detection; (2) Gestalt-principles inspired
perceptual grouping. Right: Grouping result of the state of of the art heuristic-based approach ScreenRecognition [51]
Although humans can intuitively see perceptual groups of GUI
widgets, current computational approaches are limited in parti-
tioning a GUI into meaningful groups of widgets. Some recent
work [
12
,
15
] relies on supervised deep learning methods (e.g., im-
age captioning [
27
,
44
]) to generate a view hierarchy for a GUI
image. This type of method is heavily dependent on GUI data avail-
ability and quality. To obtain sucient GUI data for model training,
they use GUI screenshots and view hierarchies obtained at applica-
tion runtime. A critical quality issue of such runtime GUI data is
that runtime view hierarchies often do not correspond to intuitive
perceptual groups due to many implementation-level tricks. For
example, in the left GUI in Figure 2, the two ListItems in a ListView
has no visual similarity (a large image versus some texts), so they do
not form a perceptual group. In the right GUI, a grid of cards form
a perceptual group but is implemented as individual FrameLayouts.
Such inconsistencies between the implemented widget groups and
the human’s perceptual groups make the trained models unreliable
to detect perceptual groups of GUI widgets.
Decades of psychology and biological vision research have for-
mulated the Gestalt theory of perception that explains how humans
see the whole rather than individual and unrelated parts. It includes
a set of principles of grouping, among which connectedness,simi-
larity,proximity and continuity are the most essential ones [
7
,
38
].
Although these principles and other related UI design principles
such as CRAP [
35
] greatly inuence how designers and developers
structure GUI widgets [
41
], they have never been systematically
used to automatically infer perceptual groups from GUI images.
Rather, current approaches [
31
,
51
] rely on ad-hoc and case-specic
rules and thus are hard to generalize on diverse GUI designs.
In this work, we systematically explore the Gestalt principles of
grouping and design the rst psychologically-inspired method for
visual inference of perceptual groups of GUI widgets. Our method
requires only GUI pixel images and is independent of GUI imple-
mentations. Our method is unsupervised, thus removing the depen-
dence on problematic GUI runtime data. As shwon in Figure 3, our
method enhances the state-of-the-art GUI widget detection method
(UIED [
18
,
46
]) to detect elementary GUI widgets. Following the
Gestalt principles, the method rst detects containers (e.g., card, list
item) with complex widgets by the connectedness principle. It then
clusters visually similar texts (or non-text widgets) by the similarity
principle and further groups clusters of widgets by the proximity
principles. Finally, based on the widget clusters, our method cor-
rects erroneously detected or missing GUI widgets by the continuity
principle (not illustrated in Figure 3, but can be seen in Figure 5).
At the right end of Figure 3, we show the grouping result by the
state-of-the-art heuristic-based method Screen Recognition [
51
].
Screen Recognition incorrectly partitions many widgets into groups,
such as the bottom navigation bar and the four widgets above the
bar, the card on the left and the text above the card. It also fails to
detect higher-order perceptual groups, such as groups of cards. In
contrast, our approach correctly recognizes the bottom navigation
bar and the top and middle row of cards. Although the text label
above the left card is very close to the card, our approach correctly
recognizes the text labels as separate widgets rather than as a part
of the left card. Our approach does not recognize the two cards
just above the bottom navigation bar because these two cards are
partially occluded by the bottom bar. However, it correctly recog-
nizes the two blocks of image and text and detects them as a group.
Clearly, the grouping results by our approach correspond more
intuitively to human perception than those by Screen Recognition.
For the evaluation, we construct two datasets: one contains 1,091
GUI screenshots from 772 Android apps, and the other contains
20 UI prototypes from a popular design tool Figma [
4
]. To ensure
the validity of ground-truth widget groups, we manually check all
these GUIs and conrm that none of these GUIs has the perception-
implementation misalignment issues shown in Figure 2. We rst
examine our enhanced version of UIED and observe that the en-
hanced version reaches a 0.626 F1 score for GUI widget detection,
which is much higher than the original version (0.524 F1). With the
detected GUI widgets, our perceptual grouping approach achieves
the F1-score of 0.593 on the 1,091 app GUI screenshots and 0.783
F1-score on the 20 UI design prototypes. To understand the impact
of GUI widget misdetections on perceptual grouping, we extract
the GUI widgets directly from the Android app’s runtime metadata
(i.e., ground-truth widgets) and use the ground-truth widgets as the
inputs to perceptual grouping. With such “perfectly-detected” GUI
widgets, our grouping approach achieves a 0.672 F1-score on app
GUIs. In contrast, Screen Recognition [
51
] performs very poorly:
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images ESEC/FSE 2022, 14 - 18 November, 2022, Singapore
0.123 F1 on the ground-truth widgets and 0.092 F1 on the detected
widgets for app screenshots, and 0.232 F1 on the detected widgets
for UI design prototypes. Although our grouping results sometimes
do not exactly match the ground-truth groups, our analysis shows
that some of our grouping results still comply with how humans
perceive the widget groups because there can be diverse ways to
structure GUI widgets in some cases.
To summarize, this paper makes the following contributions:
A robust, psychologically-inspired, unsupervised visual in-
ference method for detecting perceptual groups of GUI wid-
gets on GUI images, the code is released on GitHub1.
A comprehensive evaluation of the proposed approach and
the in-depth analysis of the performance with examples.
A high-quality dataset
2
from mobile apps and UI design
prototypes with veried perceptual groups of GUI widgets.
An analysis of how our perceptual grouping method can im-
prove UI-related SE tasks, such as UI design, implementation
and automation.
2 GUI WIDGET DETECTION
Our approach is a pixel-only approach. It does not assume any GUI
metadata or GUI implementation about GUI widgets. Instead, our
approach detects text and non-text GUI widgets directly from GUI
images. To obtain the widget information from pixels, it enhances
the state-of-the-art GUI widget detection technique UIED [
46
]. In
order to t with subsequent perceptual grouping, our approach
mitigates the incorrect merging of GUI widgets in the containers
by UIED and simplies the widget classication of UIED.
2.1 UIED-based GUI Widget Detection
UIED comprises three steps: (1) GUI text detection, (2) non-text
GUI widget detection and (3) merging of text and non-text wid-
gets. UIED uses an o-the-shelf scene text detector EAST [
53
] to
identify text regions in the GUI images. EAST is designed for han-
dling nature scene images that dier from GUI images, such as
gure-background complexity and lighting eects. Although EAST
outperforms traditional OCR tool Tesseract [
40
], we nd the latest
OCR tool developed by Google [
6
] achieves the highest accuracy
of GUI text recognition (see Section 4.1). Therefore, in our use of
UIED, we replace EAST with the Google OCR tool.
For locating non-text widgets, our approach adopts the design
of UIED that uses a series of traditional, unsupervised image pro-
cessing algorithms rather than deep-learning models (e.g., Faster
RCNN [
24
] or YOLO [
33
]). This design removes the data dependence
on GUI implementation or runtime information while accurately
detecting GUI widgets. UIED then merges the text and non-text
detection results. The purpose of this merging step is not only to
integrate the identied GUI widgets but also to cross-check the
results. Because non-text widget detection inevitably extracts some
text regions, UIED counts on the OCR results to remove these
false-positive non-text widgets. Specically, this step checks the
bounding box of all candidate non-text widget regions and removes
those intersected with text regions resulting from the OCR.
1https://github.com/MulongXie/GUI-Perceptual-Grouping
2Please nd the dataset in the shared link
2.2 Improvement and Simplication of UIED
We nd that the UIED detection results often miss some widgets in a
container (e.g. card). The reason is that, in order to lter out invalid
non-text regions and mitigate over-segmentation that wrongly
segments a GUI widget into several parts, UIED checks the widgets’
bounding boxes and merges all intersected widgets regions into a
big widget region. This operation may cause the wrong neglection
of valid GUI widgets that are enclosed in some containers. Therefore,
we equip our GUI widget detector with a container recognition
algorithm (see Section 3.2) to mitigate the issue. If a widget is
recognized as a container, then all its contained widgets are kept
and regarded as proper GUI widgets rather than noises.
The original UIED classies non-text GUI widgets as specic
widget categories (e.g., image, button, checkbox). In contrast, our
GUI widget detector only distinguishes text from non-text widgets.
Although GUI involves many types of non-text widgets, there is
no need of distinguishing actual classes of non-text widgets for
perceptual grouping. GUI widget classes indicate dierent user in-
teractions and GUI functionalities, but widgets with dierent classes
can form a perceptual group as long as they have similar visual
properties, such as size, shape, relative position and alignment with
other widgets. Therefore, we do not distinguish dierent classes of
non-text widgets. However, we need to distinguish non-text wid-
gets from text widgets, as they have very dierent visual properties
and need to be clustered by dierent strategies (see Section 3).
3 GUI WIDGET PERCEPTUAL GROUPING
After obtaining text and non-text widgets on a GUI image, the next
step is to partition GUI widgets into perceptual groups (or blocks
of items) according to their visual and perceptual properties.
3.1 Gestalt Laws and Approach Overview
Our approach is inspired by psychology and biological-vision re-
search. Perceptual grouping is a cognitive process in which our
minds leap from comprehending all of the objects as individuals
to recognizing visual patterns through grouping visually related
elements as a whole [
23
]. This process aects the way we design
GUI layouts [
35
] from alignment, spacing and grouping tool sup-
port [
4
,
37
] to UI design templates [
21
] and GUI frameworks [
2
]
It also explains how we perceive GUI layouts. For instance, in the
examples in Figure 1, we subconsciously observe that some visually
similar widgets are placed in a spatially similar way and identify
them as in a group (e.g. a card, list, multitab or menu).
Previous studies rely on ad-hoc, rigid heuristics to infer UI struc-
ture without a systematic theoretical foundation. Our approach is
the rst attempt to tackle the perceptual grouping of GUI widgets
guided by an inuential psychological theory (named Gestalt psy-
chology [
7
]) that explains how the human brain perceives objects
and patterns. Gestalt psychology’s core proposition is that human
understands external stimuli as wholes rather than as the sums of
their parts [
39
]. Based on the proposition, the Gestaltists studied
perceptual grouping [
23
] systematically and summarized a set of
“gestalt laws of grouping” [
38
]. In our work, we adopt the four most
eective principles which greatly inuence GUI design [
41
] in prac-
tice as the guideline for our approach design: (1) connectedness
(2) similarity (3) proximity and (4) continuity.
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
We dene a group of related GUI widgets as a layout block of
items. A typical example is a list of list items in the GUI, or a card
displaying an image and some texts. The fundamental intuition
is: if a set of widgets have similar visual properties and are placed
in alignment with similar space between each other, they will be
“perceived” as in the same layout block by our approach according
to the Gestalt principles. In detail, our approach consists of four
grouping steps in accordance with four Gestalt principles. First, it
identies containers along with their contained widgets that full
the connectedness law. Second, it uses an unsupervised clustering
method DBSCAN [
22
] to cluster text or non-text GUI widgets based
on their spatial and size similarities. Next, it groups proximate and
spatially aligned clusters to form a larger layout block following
the proximity law. Finally, in line with the continuity principle,
our approach corrects some mistakes of GUI widget detection by
checking the consistency of the groups’ compositions.
Next, we elaborate on the adopted principles and the correspond-
ing steps with some illustrative examples. In our discussion, we
dene the center of a widget as
(𝐶𝑒𝑛𝑡𝑒𝑟𝑋, 𝐶 𝑒𝑛𝑡𝑒 𝑟𝑌)
, the top left
corner and bottom right corner of its bounding box as
(𝑇𝑜 𝑝, 𝐿𝑒 𝑓 𝑡 )
and (𝐵𝑜𝑡 𝑡𝑜𝑚, 𝑅𝑖𝑔ℎ𝑡 ).
3.2 Connectedness - Container Recognition
In Gestalt psychology, the principle of uniform connectedness is
the strongest principle concerned with relatedness [34]. It implies
that we perceive elements connected by uniform visual properties
as being more related than those not connected. The forms of the
connection can be either a line connecting several elements or a
shape boundary that encloses a group of related elements. In the
GUI, the presentation of the connectedness is usually a box con-
tainer that contains multiple widgets within it, and all the enclosed
widgets are perceived as in the same group. Thus, the rst step of
our grouping approach is to recognize the containers in a GUI.
In particular, we observe that a container is visually a (round)
rectangular wireframe enclosing several children widgets. The card
is a typical example of such containers, as shown in Figure 1(a).
Therefore, with the detected non-text widgets, the algorithm rst
checks if a widget is of rectangle shape by counting how many
straight lines its boundary comprises and how they compose. Specif-
ically, we apply the geometrical rule that a rectangle’s sides are
made of 4 straight lines perpendicular to each other. Subsequently,
our approach determines if the widget’s boundary is a wireframe
border by checking if it is connected with any widgets inside its
boundary. If a widget satises the above criteria, it will be identied
as a container, and all widgets contained within it are partitioned
into the same perceptual group.
3.3 Similarity - Widget Clustering
The principle of similarity suggests that elements are perceptually
grouped together if they are similar to each other [
9
]. Generally,
similarity can be observed in aspects of various visual cues, such
as size, color, shape or position. For example, in the second GUI of
Figure 1, the image widgets are of the same size and aligned with
each other in the same way (i.e., same direction and spacing), so
we visually perceive them as a group. Likewise, the text pieces on
the right of the image widgets are perceptually similar even though
Figure 4: Widget clustering, cluster conict resolving and
nal resulting groups in which we use the same color to paint
the widgets in the same subgroup and highlight higher-order
groups in pink boxes
they have dierent font styles and lengths because they have the
same alignment with neighbouring texts.
3.3.1 Initial Widget Clustering. Our approach identies similar
GUI widgets by their spatial and visual properties and aggregates
similar GUI widgets into blocks by similarity-based clustering. It
clusters texts and non-text widgets through dierent strategies.
In general, similar non-text widgets in the same block (e.g. a list)
usually have similar sizes and align to one another vertically or
horizontally with the same spacing. Texts in the same block are
always left-justied or top-justied (assume left-to-right text orien-
tation), but their sizes and shapes can vary signicantly because
of dierent lengths of text contents. Thus, the approach clusters
the non-text widgets by their center points
(𝐶𝑒𝑛𝑡𝑒𝑟𝑋, 𝐶 𝑒𝑛𝑡𝑒 𝑟𝑌)
and
areas, and it clusters texts by their top-left corner (𝑇 𝑜𝑝, 𝐿𝑒 𝑓 𝑡 )
Our approach uses the DBSCAN (Density-Based Spatial Cluster-
ing of Applications with Noise) algorithm [
22
] to implement the
clustering. Intuitively, DBSCAN groups the points closely packed to-
gether (points with many nearby neighbors), while marking points
whose distance from the nearest neighbor is greater than the maxi-
mum threshold as outliers. In the GUI widget clustering context, the
point is the GUI widget, and the distance is the dierence between
the values of the widgets’ attribute that the clustering is based on.
Figure 4 illustrates the clustering process. For non-text widgets,
our approach performs the clustering three times based on three
attributes respectively. It rst clusters the widgets by
𝐶𝑒𝑛𝑡 𝑒𝑟𝑥
for
the horizontal alignment, then by
𝐶𝑒𝑛𝑡 𝑒𝑟𝑌
for the vertical align-
ment and nally by
𝑎𝑟𝑒𝑎
. These operations produce three clus-
ters:
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑜 𝑛𝑡𝑒𝑥 𝑡
ℎ𝑜𝑟 𝑖𝑧𝑜𝑛𝑡 𝑎𝑙
,
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑜 𝑛𝑡𝑒𝑥 𝑡
𝑣𝑒𝑟 𝑡𝑖𝑐 𝑎𝑙
and
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑜 𝑛𝑡𝑒𝑥 𝑡
𝑎𝑟𝑒 𝑎
. Our
approach then clusters the text widgets twice based on their top left
corner point
(𝑇𝑜 𝑝, 𝐿𝑒 𝑓 𝑡 )
for left-justied (vertical) and top-justied
(horizontal) alignment. It produces the
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑡 𝑒𝑥 𝑡
ℎ𝑜𝑟𝑖 𝑧𝑜𝑛𝑡 𝑎𝑙
based on
the texts’
𝑇𝑜𝑝
, and the
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑡 𝑒𝑥 𝑡
𝑣𝑒𝑟 𝑡𝑖𝑐 𝑎𝑙
based on the texts’
𝐿𝑒 𝑓 𝑡
. The
resulting clusters are highlighted by dierent colors and numbers
in Figure 4. We only keep the clusters with at least two widgets and
discard those with only one widget.
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images ESEC/FSE 2022, 14 - 18 November, 2022, Singapore
Figure 5: Examples of widget detection error correction. (1st
column - green box: non-text; red-box: text; 2nd column
- same color: higher-order perceptual group; 3rd column -
same color: subgroup of widgets)
3.3.2 Cluster Conflicts Resolving. It is common that some widgets
can be clustered into dierent clusters by dierent attributes, which
causes cluster conicts. For example, as illustrated in Figure 4,
several non-text widgets (e.g., the bottom-left image) are both in a
vertical cluster (marked in blue) and a horizontal cluster (marked
in red). The intersection of clusters illustrates the conict. The
approach shall resolve such cluster conicts to determine to which
group the widget belongs. This conict-resolving step also complies
with the similarity principle that suggests the widgets in the same
perceptual group should share more similar properties.
The conict resolving step rst calculates the average widget
areas of the groups to which the conicting widget has been as-
signed. In accordance with the similarity principle, the widget is
more likely to be in a group whose average widget area is similar
to the conicting widget’s area. In addition, another observation
is that repetitive widgets in a group have similar spacing between
each other. So for a widget that is clustered into multiple candidate
groups, the approach checks the dierence between the spacing of
this widget and its neighboring widgets in a group and the average
spacing between other widgets in that group. It keeps the widget in
the group where the conicting widget has the largest widget-area
similarity and the smallest spacing dierence compared with other
widgets in the group. For example, the bottom-left image widget
will be assigned to the vertical cluster rather than the horizontal one
according to our conict resolving criteria. After conict resolving,
our approach produces the nal widget clustering results as shown
in the right part of Figure 4. We use dierent colors and indices to
illustrate the resulting non-text (nt) and text (t) clusters.
3.4 Proximity - Iterative Group Pairing
So far, GUI widgets are aggregated into groups as per the connected-
ness and similarity principles. Some groups are close to each other
and similar in terms of the number and layout of the contained
widgets, which may further form a larger perceptual group even
though these groups may contain dierent types of widgets. For
example, in the clustering result of Figure 4, we can observe that
the clusters nt-0,t-0,t-0-1 and nt-2 are proximate and have the
same or similar number of widgets aligned in the same way. We
can see this feature intentionally or subconsciously and perceive
them as in the same large group as a whole. Gestalt psychology
states that when people see an assortment of objects, they tend to
perceive objects that are close to each other as a group [
9
]. The
close distance, also known as proximity, of elements is so powerful
that it can override widget similarity and other factors that might
dierentiate a group of objects [
43
]. Thus, the next step is based on
widget clusters’ proximity and composition similarity to pair the
clusters into a larger group (i.e., layout block).
If two groups
𝐺𝑟𝑜𝑢𝑝𝑎
and
𝐺𝑟𝑜𝑢𝑝𝑏
are next (proximate) to each
other (i.e., no other groups in between), and they contain the same
number of widgets and the widgets in the
𝐺𝑟𝑜𝑢𝑝𝑎
and the
𝐺𝑟𝑜𝑢𝑝𝑏
share the same orientation (vertical or horizontal), our approach
combines
𝐺𝑟𝑜𝑢𝑝𝑎
and
𝐺𝑟𝑜𝑢𝑝𝑏
into a larger block. A widget in
𝐺𝑟𝑜𝑢𝑝𝑎
and its closet widget in
𝐺𝑟𝑜𝑢𝑝𝑏
will be paired and form a
subgroup of widgets. Our approach rst combines the two prox-
imate groups containing the same type of widgets, and then the
groups containing dierent types of widgets. The formed larger
block can be iteratively combined with the proximate groups until
no more proximate groups are available.
Sometimes there are dierent numbers of widgets in the two
proximate groups but the two groups may still form one larger
perceptual block. For example, the cluster nt-2 in Figure 4 has one
less widget compared to nt-0,t-0 and t-0-1 because the bottom
widget in the right column is occluded by the oat action button
and thus missed by the detector. Another common reason for the
widget number dierence is that widgets in a group may be set as
invisible in some situations, and thus they do not appear visually.
Therefore, if the dierence between the number of widgets in the
two proximate groups is less than 4 (empirically determined from
the ground-truth groups in our dataset), our approach also combines
the two groups into a larger block.
As shown in the nal groups in Figure 4, our approach identies
a set of perceptual groups (blocks), including the multitab at the
top and the list in the main area. Each list item is a combined
widget of some non-text and text widgets (highlighted in the same
color). These perceptual groups encode the comprehension of the
GUI structure into higher-order layout blocks that can be used in
further processing and applications.
3.5 Continuity - Detection Error Correction
The GUI widget detector may make two types of detection errors -
missed widgets and misclassied widgets. Missed widgets means
that the detector fails to detect some GUI elements on the GUI (e.g.,
the bottom-right icon in Figure 5(a)). Misclassied widgets refer to
the widgets that the detector reports the wrong type, for example,
the top-right small icon (i.e., a non-text widget) in the middle card
in Figure 5(b) is misclassied as a text widget due to an OCR error.
It is hard to recognize and correct these detection errors from the
individual widget perspective, but applying the Gestalt continuity
principle to expose such widget detection errors by contrasting
widgets in the same perceptual groups can mitigate the issue. The
continuity principle states that elements arranged in a line or curve
are perceived to be more related than elements not in a line or curve
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
Figure 6: Examples of GUI widget detection and perceptual grouping results (red box - text widget, green box - non-text widget,
pink box - perceptual group). Metadata-based means grouping the ground-truth widgets directly from GUI metadata.
[
34
]. Thus, some detection errors are likely to be spotted if a GUI
area or a widget aligns with all the widgets in a perceptual group
in a line but is not gathered into that group.
Our approach tries to identify and x missed widgets as follows.
It rst inspects the subgroups of widgets in a perceptual group and
checks if the widgets in the subgroups are consistent in terms of the
number and relative position of the contained widgets. If a subgroup
contains fewer widgets than its counterparts, then the approach
locates the inconsistent regions by checking the relative positions
and areas of other subgroups’ widgets. Next, the approach crops the
located UI regions and uses the widget detector upon the cropped
regions with relaxed parameters (i.e. double of the minimum area
threshold for valid widgets) to try to identify the missed widget, if
any. For example, the tiny icon at the bottom right in Figure 5(a)
is missed because its area is so small that the detector regards it
as a noisy region and hence discards it in the initial detection. By
analyzing the resulting perceptual group and its composition, our
approach nds that seven of the eight subgroups have two widgets
(marked in the same color), while the subgroup at the bottom right
has only one widget. It crops the area that may contain the missed
widget according to the average sizes and average relative positions
of the two widgets in the other seven subgroups. The missed tiny
icon can be recovered by detecting the widget with the relaxed
valid-widget minimum area threshold in the missing area.
Our approach uses the exact mechanism that contrasts the sub-
groups to identify the misclassied widgets, but here it focuses on
Table 1: Overall results of widget detection (IoU >0.9)
Our Enhanced Revision Original UIED
Type Precision Recall F1 Precision Recall F1
Non-Text 0.589 0.823 0.687 0.431 0.469 0.449
Text 0.678 0.693 0.686 0.402 0.720 0.516
All Widgets 0.580 0.680 0.626 0.490 0.557 0.524
widget type consistency. As shown in Figure 5(b), our approach
groups the three cards in a perceptual group. By contrasting the
widgets in the three cards, it detects that the middle card has text
widgets at the top right corner, while the other two cards have a
non-text widget at the same relative positions. Based on the conti-
nuity principle, our approach re-classies the top-right widget in
the middle card as non-text with a majority-win strategy.
4 EVALUATION
We evaluate our approach in two steps: (1) examine the accuracy
of our enhanced version of UIED and compare it with the orig-
inal UIED [
18
]; (2) examine the accuracy of our widget percep-
tual grouping approach and compare it with the state-of-the-art
heuristic-based method Screen Recognition [51].
4.1 Accuracy of GUI Widget Detection
Compared with the original UIED [
18
], our GUI widget detector
uses the latest Google OCR tool and improve the text and non-text
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images ESEC/FSE 2022, 14 - 18 November, 2022, Singapore
widget merging by container analysis. We evaluate GUI widget
detection from the three perspectives: text widget detection, non-
text widget detection and the nal widget results after merging. To
be consistent with the evaluation setting in the UIED paper [
18
], we
run experiments on the same Rico dataset of Android app GUIs [
29
]
and regard the detected widgets whose intersection over union
(IoU) with the ground truth widget is over 0.9 as true positive. The
ground-truth widgets are the leaf widgets (i.e., non-layout classes)
extracted from the GUI’s runtime view hierarchy.
Table 1 shows the widget detection performance of the enhanced
and the original UIED. Our enhanced version achieves a much
higher recall (0.823) for non-text widgets than the original UIED
(0.469), and meanwhile, it also improves the precision (0.589 over
0.431). This signicant improvement is due to the more intelligent
container-aware merging of text and non-text widgets by our en-
hanced version. As the original UIED is container-agnostic, it erro-
neously discards many valid widgets contained in other widgets as
noise. For GUI text, the Google OCR tool used in the enhanced ver-
sion achieves much higher precision (0.678) than the EAST model
used in the original UIED (0.402), with a slight decrease in recall
(0.693 versus 0.720). The improvements in both text and non-text
widgets result in a much better overall performance (0.626 F1 by
the enhanced version versus 0.524 by the original UIED).
4.2 Perceptual Grouping Performance
We evaluate our perceptual grouping approach on both Android
app GUIs and UI design prototypes. Figure 6 shows some perceptual
grouping results by our approach. These results show our approach
can reliably detect GUI widgets and infer perceptual groups for
diverse visual and layout designs.
4.2.1 Datasets. Our approach simulates how humans perceive the
GUI structure and segment a GUI into blocks of widgets according
to the Gestalt principles of grouping. To validate the recognized
blocks, we build the ground-truth dataset from two sources: An-
droid apps and UI design prototypes. The ground truth annotates
the widget groups according to the GUI layout and the widget styles
and properties as shown in Figure 7.
Android App GUI Dataset The ground truth of widget groups
can be obtained by examining the layout classes used to group other
widgets in the implementations. However, as shown in Figure 2, the
layout classes do not always correspond to the perceptual groups
of GUI widgets. Therefore, we cannot use the GUI layout classes
directly as the ground truth. Instead, we rst search the GUIs in
the Rico dataset of Android app GUIs [
29
] that use certain Android
layout classes that may contain a group of widgets (e.g., ListView,
FrameLayout, Card, TabLayout). Then we manually examine the
candidate GUIs to lter out those whose use of layout classes has
obvious violations against the Gestalt principles. Furthermore, the
Rico dataset contains many highly-similar GUI screenshots for an
application. To increase the visual and layout diversity of GUIs
in our dataset, we limit up to three distinct GUI screenshots per
application. Distinction is determined by the number and type of
GUI widgets and the GUI structure. We obtain 1091 GUI screenshots
from 772 Android applications. Using this dataset, we evaluate both
detection-based and metadata-based grouping. Detection-based
grouping processes the detected widgets, while metadata-based
Figure 7: Examples of Android app GUI and UI design proto-
type, view hierarchy and ground truth
grouping uses the widgets obtained from the GUI metadata (i.e.,
assumes the perfect widget detection).
UI Design Prototypes We collect 20 UI design prototypes
shared on a popular UI design website (Figma [
4
]). These UI design
prototypes are created by professional designers for various kinds
of apps and receive more than 200 likes. This small set of UI design
prototypes demonstrates how professional designers structure the
GUIs and group GUI widgets from the design rather than the im-
plementation perspective. As a domain-independent tool, Figma
supports only elementary visual elements (i.e., text, image and
shape). Designers can create any widgets using these elementary
visual elements. Due to the lack of explicit and uniform widget meta-
data in the Figma designs, we evaluate only the detection-based
grouping on these UI design prototypes.
4.2.2 Metrics. The left part of Figure 7 shows an example in our An-
droid app GUI dataset. We see that the layout classes (e.g., ListView,
TabLayout) in the view hierarchy map to the corresponding percep-
tual groups. In our dataset, specic layout classes are generalized
to blocks, as we only care about generic perceptual groups in this
work. Following the work [
15
] for generating GUI component hi-
erarchy from UI image, we adopt the sequence representation of
a GUI component hierarchy. Through depth-rst traversal, a view
hierarchy can be transformed into a string of GUI widget names
and brackets (“[]” and “()” corresponding to the blocks). This string
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
Figure 8: Performance at dierent edit distance thresholds
represents the ground-truth perceptual groups of an app GUI. As
discussed in Section 2.1, perceptual grouping is based on the wid-
gets’ positional and visual properties while the actual classes of
non-text widgets are not necessary. Thus, the ground-truth string
only has two widget types: Text and Non-text. Specically, it con-
verts TextView in the view hierarchy to text and all other classes
as Non-text. Similarly, as shown in the right part of Figure 7, the
designers organize texts and non-text widgets (images or shape
compositions referred to as frames) into a hierarchy of groups.
Based on the design’s group hierarchy, we output the ground-truth
string of perceptual groups. The perceptual groups “perceived” by
our approach are output in the same format for comparison.
We compute the Levenshtein edit distance between the two
strings of a ground-truth block and a perceived block. The Leven-
shtein edits inform us of the specic mismatches between the two
blocks, which is important to understand and analyze grouping
mistakes. If the edit distance between the ground-truth block and
the perceived block is less than a threshold, we regard the two
blocks as a candidate match. We determine the optimal matching
between the string of ground-truth blocks and the string of the
perceived blocks by minimizing the overall edit distance among all
candidate matches. If a perceived group matches a ground-truth
group, it is a true positive (TP), otherwise a false positive (FP). If
a ground-truth group does not match any perceived group, it is a
false negative (FN). Based on the matching results, we compute:
(1) precision (TP/(TP+FP)) (2) recall (TP/(TP+FN)), and (3) F1-score
((2*precision*recall) / (precision+recall))
4.2.3 Performance on Android App GUIs. We experiment with ve
edit distance thresholds (0-4). The distance 0 means the two blocks
have the perfect match, and the distance 4 means as long as the
unmatched widgets in the two blocks are no more than 4, the two
blocks can be regarded as a candidate match. As shown in Figure 8,
for detection-based grouping, the precision, recall and F1-score is
0.437, 0.520 and 0.475 at the distance 0. As the distance threshold in-
creases (i.e., the matching criterion relaxes), the precision, recall and
F1-score keeps increasing to 0.652, 0.776 and 0.709 at the distance
threshold 4. As shown in Figure 8 and Table 2, the metadata-based
grouping with the ground-truth GUI widgets achieves a noticeable
improvement over the detection-based grouping with the detected
widgets, in terms of all three metrics, especially for recall. This sug-
gests that improving GUI widget detection will positively improve
the subsequent perceptual grouping.
As the examples in Figure 6 show, our approach can not only
accurately process GUIs with clear structures (e.g., the rst row),
but it can also process GUIs with large numbers of widgets that are
Table 2: Performance comparison (edit distance1)
Widgets Approach #Bock Precision Recall F1
Metadata Our Approach 1,465 0.607 0.754 0.672
Screen Recog 1,038 0.131 0.116 0.123
Detection Our Approach 1,260 0.546 0.650 0.593
Screen Recog 992 0.103 0.083 0.092
placed in a packed way (e.g., the second and third rows). Further-
more, our approach is fault-tolerant to GUI widget detection errors
to a certain extent, for example, the second row of detection-based
grouping for screenshot and design. The map and the pushed-aside
partial GUI result in many inaccurately detected GUI widgets in
these two cases. However, our approach still robustly recognizes
the proper perceptual groups.
We compare our approach with the heuristic-based grouping
method (Screen Recognition) proposed in Zhang et.al. [
51
] (which
received the distinguished paper award at CHI2021). The results in
Table 2 shows that Screen Recognition can hardly handle visually
and structurally complicated GUIs based on a few ad-hoc and rigid
heuristics. Its F1 score is only 0.092 on the detected widgets and
0.123 on the ground-truth widgets. This is because its heuristics
are designed for only some xed grouping styles such as cards and
multi-tabs. In contrast, our approach is designed to full generic
Gestalt principles of grouping.
We manually inspect the grouping results by our approach
against the ground-truth groups to identify the potential improve-
ments. Figure 9 presents four typical cases that cause the perceived
groups to be regarded as incorrect. For the detection-based group-
ing, the major issue is GUI widget over-segmentation (a widget is
detected as several widgets) or under-segmentation (several widgets
are detected as one widget). In the rst row, the detector segments
the texts on the right side of the GUI into several text and non-text
widgets. As indicated by the same color in the Grouping Result col-
umn, our approach still successfully partitions the widgets on the
same row into a block, and recognizes the large group containing
these row blocks. But as shown in the Group Comparison column,
one widget in the second, third and fourth detected blocks do not
match those in the corresponding ground-truth blocks. In the sec-
ond row, the GUI widget detector merges close-by multi-line texts
as a single text widget, while these text widgets are separate wid-
gets in the ground truth. Again, our approach recognizes the overall
perceptual groups correctly, but the widgets in the corresponding
blocks do not completely match.
While using the ground-truth widgets from the GUI metadata
to mitigate the GUI widget misdetection, the grouping results see
the improvement but suer from two other problems. First, the
widgets in the metadata contain some widgets that are visually oc-
cluded or hidden. The third row in Figure 9 illustrates this problem,
where some widgets are actually occluded behind the menu on the
left, but they are still available in the runtime metadata and are
extracted as the ground-truth widgets. This results in a completely
incorrect grouping. The issue of widget occlusion or modal window
could be mitigated as follows: train an image classier to predict
the presence of widget occlusion or modal window, then follow
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images ESEC/FSE 2022, 14 - 18 November, 2022, Singapore
Figure 9: Typical causes of grouping mistakes (red box - text
widget, green box - non-text widget, pink box - perceptual
group, red dashed box - unmatched ground-truth widget)
the gure-ground principle [
1
] to separate foreground modal win-
dow from the background, and nally detect the perceptual groups
on the separated model window. Second, alternative ways exist to
partition the widgets into groups. For example, for the GUI in the
fourth row, the ground truth contains eight blocks, each of which
has one image and one text while our grouping approach partitions
these blocks into four rows of a large group, and each row contains
two blocks (as indicated by the same color in Grouping Result). Per-
ceptually, both ways are acceptable but the group dierences cause
the grouping result by our approach to be regarded as incorrect.
4.2.4 Performance on UI Design Prototypes. Tested on the 20 UI
design prototypes, our approach achieves the precision of 0.750,
the recall 0.818 and the F1-score 0.783. The third column in Figure 6
shows some results of our grouping approach for the UI design
prototypes, where we see it is able to infer the widget groups well
for dierent styles of GUI designs. GUI widget detection is more
robust on UI design prototypes due to the more accurate GUI wid-
get detection, which leads to the improvement of the subsequent
grouping of detected widgets. As shown in Figure 6, the widgets in
a UI design prototype is usually scattered, while the real app GUIs
are packed. Both GUI widget detection and perceptual grouping
become relatively easier on less packed GUIs.
4.2.5 Processing Time. As the GUI widget grouping can be used as
a part of various automation tasks such as automated testing, the
runtime performance can be a concern. We record the processing
time while running our approach over the dataset to get a sense of
its eciency. Our experiments run on a machine with Windows
10 OS, Intel i7-7700HQ CPU, and 8GB memory. Our approach com-
prises two major steps: widget detection and perceptual grouping.
We improved and refactored the original UIED to boost the run
performance of the widget detection, and now it takes an average
of 1.1s to detect the widgets in a GUI, which signicantly exceeds
the original UIED that takes on average 9s per GUI. The grouping
process is also ecient, which takes an average of 0.6s to process a
GUI. In total, the average processing time of the entire approach
is 1.7s per GUI image. Furthermore, as our approach does not in-
volve any deep learning techniques, it does not require advanced
computing support such as GPU.
5 RELATED WORK
Our work falls into the area of reverse-engineering the hidden
attributes of GUIs from pixels. There are two lines of closely related
work: GUI widget detection and GUI-to-text/code generation.
GUI widget detection is a special case of object detection [
24
,
33
,
42
]. Earlier work [
31
] uses classic computer vision (CV) algorithms
(e.g., Canny edge and contour analysis) to detect GUI widgets. Re-
cently, White et al. [
45
] apply a popular object detection model
YOLOv2 [
33
] to detect GUI widgets in GUI images for random GUI
testing. Feng et al. [
14
] apply Faster RCNN [
24
] to obtain GUI wid-
gets from app screenshots and construct a searchable GUI widget
gallery. A recent study by Xie et al. [
18
] shows that both classic
CV algorithms and recent deep learning models have limitations
when applied to GUI widget detection, which has dierent visual
characteristics and detection goals from natural scene object detec-
tion. They design a hybrid method UIED inspired by the unique
gure-ground [
1
] characteristic of GUI, which achieves the start-of-
the-art performance for GUI widget detection. Our approach boosts
the UIED’s performance by container-aware widget merging and
further recognizing perceptual groups of GUI widgets.
GUI-to-text/code generation also receives much attention. To
improve GUI accessibility, Chen et al. [
17
] propose a transformer-
based image captioning model for producing labels for icons. To
implement GUI view hierarchy, REMAUI [
31
] infers three Android-
specic layouts (LinearLayout, FrameLayout and ListView) based
on hand-craft rules to group widgets. Recently, Screen Recogni-
tion [
51
] develops some heuristics for inferring tabs and bars. How-
ever, these heuristic-based widget grouping methods cannot handle
visually and structurally complicated GUI designs (e.g., nested per-
ceptual groups like a grid of cards). Alternatively, image captioning
models [
27
,
44
] have been used to generate GUI view hierarchy
from GUI images [
12
,
15
]. Although these image-captioning based
methods get rid of hard-coded heuristics, they suer from GUI
data availability and quality issues (as discussed in Introduction
and illustrated in Figure 2). These methods also suer from code
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
redundancy and no explicit image-code traceability issues (see Sec-
tion 6.2). The perceptual groups recognized by our approach could
help to address these issues.
None of the existing GUI widget detection and GUI-to-code ap-
proaches solve the perceptual grouping problem in a systematic
way as our approach does. ReDraw [
30
] and FaceO [
52
] solves
the layout problem by nding in the codebase the layouts con-
taining similar GUI widgets. Some other methods rely on source
code or specic layout algorithm (e.g., Android RelativeLayout) to
synthesize modular GUI code or layout [
11
,
13
] or infer GUI duplica-
tion [
47
]. All these methods are GUI implementation-oriented, and
hard to generalize for other application scenarios such as UI design
search, UI automation, robotic GUI testing or accessibility enhance-
ment. In contrast, our approach is based on domain-independent
Gestalt principles and is application-independent, so it can support
dierent downstream SE tasks (see Section 6).
In the computer vision community, some machine learning tech-
niques [
26
,
48
,
50
] have been proposed to predict structure in the
visual scene, i.e., so-called scene graphs. These techniques can infer
the relationships between objects detected in an image and describe
these relationships by triplets (<subject, relation, object>). However,
such relationship triplets cannot represent complex GUI widget
relations in perceptual groups. Furthermore, these techniques also
require sucient high-quality data for model training, which is a
challenging issue for GUIs.
6 PERCEPTUAL GROUPING APPLICATIONS
Our perceptual grouping method lls in an important gap for auto-
matic UI understanding. Perceptual groups, together with elemen-
tary widget information, would open the door to some innovative
applications in software engineering domain.
6.1 UI Design Search
UI design is a highly creative activity. The proliferation of UI design
data on the Internet enables data-driven methods to learn UI designs
and obtain design inspirations [
20
,
29
]. However, this demands
eective UI design search engines. Existing methods often rely on
the GUI metadata, which limits their applicability as most GUI
designs exist in only pixel format. GalleryDC [
14
] builds a gallery
of GUI widgets and infer elementary widget information (e.g., size,
primary color) to help widget search. Unfortunately, this solution
does not apply to the whole and complex UIs. Chen et al. [
16
] and
Rico [
29
] use image auto-encoder to extract image features through
self-supervised learning, which can be used to nd visually similar
GUI images. However, the image auto-encoder encodes only pixel-
level features, but is unaware of GUI structure, which is very critical
to model and understand GUIs. As such, given the GUI in the left
of Figure 2, these auto-encoder based methods may return a GUI
like the one on the right of Figure 2, because both GUIs have rich
graphic features and some textural features. Unfortunately, such
search results are meaningless, because they bear no similarity in
terms of GUI structure and perceptual groups of GUI widgets. Our
approach can accurately infer perceptual groups of GUI widgets
from pixels. Based on its perceptual grouping results, a UI design
search would become structure-aware, and nds not only visually
but also structurally similar GUIs. For example, a structure-aware
UI design search would return a GUI like the one in 2nd-row-1st-
column of Figure 6 for the left GUI in Figure 2.
6.2 Modular GUI-to-Code Generation
Existing methods for GUI-to-code generation either use hand-craft
rules or specic layout algorithms to infer some specic implemen-
tation layout [
13
,
31
], or assume the availability of a codebase to
search layout implementations [
30
,
52
]. Image-captioning based
GUI-to-Code methods [
12
,
15
] are more exible as they learn how
to generate GUI view hierarchy from GUI metadata (if available).
However, the nature of image captioning is to just describe the im-
age content, but it is completely unaware of GUI structure during
the code generation. As such, the generated GUI code is highly re-
dundant for repetitive GUI blocks. For example, for the card-based
GUI design in Figure 1(a), it will generate eight pieces of repetitive
code, one for each of the eight cards. This type of generated code is
nothing like the modular GUI code developers write. So it has little
practicality. Another signicant limitation of image captioning is
that the generated GUI layouts and widgets have no connection to
the corresponding parts in the GUI image. For a GUI with many
widgets (e.g., those in the 2nd and 3rd rows in Figure 6), it would
be hard to understand how the generated code implements the
GUI. With the support of our perceptual grouping, GUI-to-code
generation can encapsulate the widget grouping information into
the code generation process and produce much less redundant and
more modular, reusable GUI code (e.g., extensible card component).
6.3 UI Automation
Automating UI understanding from pixels can support many UI
automation tasks. A particular application of UI automation in
software engineering is automatic GUI testing. Most existing meth-
ods for automatic GUI testing rely on OS or debugging infrastruc-
ture [
5
,
10
,
25
,
36
]. In recent years, computer vision methods have
also been used to support non-intrusive GUI testing [
32
,
45
]. How-
ever, these methods only work at the GUI widget level through
either traditional widget detection [
31
] or deep learning models
like Yolo [
33
]. Furthermore, they only support random testing, i.e.,
random interactions with some widgets. Some studies [
19
,
49
] show
that GUI testing would be more eective if the testing methods
were aware of more likely interactions. They propose deep learning
methods to predict such likely interactions. However, the learning
is a completely black box. That is, they can predict where on the
GUI some actions could be applied, but they do not know what will
be operated and why so. Our approach can inform the learning
with higher-order perceptual groups of GUI widgets so that the
model could make an explainable prediction, for example, scrolling
is appropriate because this part of GUI displays a list of repetitive
blocks. It may also guide the testing methods to interact with the
blocks in a perceptual group in an orderly manner, and ensure all
blocks are tested without unnecessary repetitions. Such support
for UI automation would also enhance the eectiveness of screen
readers which currently heavily rely on accessibility metadata and
use mostly elementary widget information.
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images ESEC/FSE 2022, 14 - 18 November, 2022, Singapore
7 CONCLUSION AND FUTURE WORK
This paper presents a novel approach for recognizing perceptual
groups of GUI widgets in GUI images. The approach is designed
around the four psychological principles of grouping - connect-
edness, similarity, proximity and continuity. To the best of our
knowledge, this is the rst unsupervised, automatic UI understand-
ing approach with a systematic theoretical foundation, rather than
relying on ad-hoc heuristics or model training with GUI meta-
data. Through the evaluation of both mobile app GUIs and UI de-
sign prototypes, we conrm the high accuracy of our perceptual
grouping method for visually and structurally diverse GUIs. Our
approach lls the gap of visual intelligence between the current
widget-level detection and the whole-UI level GUI-to-code gener-
ation. As a pixel-only and application-independent approach, we
envision our approach could enhance many downstream software
engineering tasks with the visual understanding of GUI structure
and perceptual groups, such as structure-aware UI design search,
modular and reusable GUI-to-code generation, and layout-sensitive
UI automation for GUI testing and screen reader. Although our
current approach achieves very promising performance, it can be
further improved by dealing with widget occlusion or modal win-
dow. Moreover, we will investigate semantic grouping that aims
to recognize both interaction and content semantics of perceptual
groups. Semantic grouping will provide a deeper understanding of
UI for many downstream tasks, such as the detection and analysis
of deceptive UI dark patterns. We will involve GUI designers and
developers in the evaluation of our perceptual grouping method in
the downstream applications it supports.
REFERENCES
[1]
[n.d.]. 7 Gestalt Principles of Visual Perception: Cognitive Psychology for UX:
UserTesting Blog. https://www.usertesting.com/blog/gestalt-principles#gure
[2] [n.d.]. Free website builder: Create a free website. http://www.wix.com/
[3]
[n.d.]. Get started on Android with TalkBack - Android Accessibility Help.
https://support.google.com/accessibility/android/answer/6283677?hl=en
[4] [n.d.]. the collaborative interface design tool. https://www.gma.com/
[5]
[n.d.]. UI/Application Exerciser Monkey : Android Developers. https:
//developer.android.com/studio/test/monkey#:~:text=TheMonkeyisaprogram,
arandomyetrepeatablemanner.
[6]
[n.d.]. Vision AI | Derive Image Insights via ML | Cloud Vision API. https:
//cloud.google.com/vision/
[7]
2006. "Gestalt psychology". Britannica concise encyclopedia. Britannica Digital
Learning.
[8] 2021. Accessibility - Vision. https://www.apple.com/accessibility/vision/
[9]
2021. Gestalt psychology. https://en.wikipedia.org/wiki/Gestalt_psychology#
cite_note-1
[10]
UIAutomator, 2021. https://developer.android.com/training/testing/ui-
automator.
[11]
Mohammad Bajammal, Davood Mazinanian, and Ali Mesbah. 2018. Gener-
ating Reusable Web Components from Mockups. In Proceedings of the 33rd
ACM/IEEE International Conference on Automated Software Engineering (Montpel-
lier, France) (ASE 2018). New York, NY, USA, 601–611. https://doi.org/10.1145/
3238147.3238194
[12]
Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter-
face screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering
Interactive Computing Systems. 1–6.
[13]
Pavol Bielik, Marc Fischer, and Martin Vechev. 2018. Robust Relational Layout
Synthesis from Examples for Android. Proc. ACM Program. Lang. 2, OOPSLA,
Article 156 (Oct. 2018), 29 pages. https://doi.org/10.1145/3276526
[14]
Chunyang Chen, Sidong Feng, Zhenchang Xing, Linda Liu, Shengdong Zhao,
and Jinshui Wang. 2019. Gallery D.C.: Design Search and Knowledge Discovery
through Auto-Created GUI Component Gallery. Proc. ACM Hum.-Comput. Inter-
act. 3, CSCW, Article 180 (Nov. 2019), 22 pages. https://doi.org/10.1145/3359282
[15]
Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu.
2018. From UI Design Image to GUI Skeleton: A Neural Machine Transla-
tor to Bootstrap Mobile GUI Implementation. In Proceedings of the 40th In-
ternational Conference on Software Engineering (Gothenburg, Sweden) (ICSE
’18). Association for Computing Machinery, New York, NY, USA, 665–676.
https://doi.org/10.1145/3180155.3180240
[16]
Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xin Xia, Liming Zhu, John
Grundy, and Jinshui Wang. 2020. Wireframe-based UI Design Search through
Image Autoencoder. ACM Transactions on Software Engineering and Methodology
29, 3 (Jul 2020), 1–31. https://doi.org/10.1145/3391613
[17]
Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhut, Guo-
qiang Li, and Jinshui Wang. 2020. Unblind Your Apps: Predicting Natural-
Language Labels for Mobile GUI Components by Deep Learning. In 2020
IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 322–334.
[18]
Jieshan Chen, Mulong Xie, Zhenchang Xing, Chunyang Chen, Xiwei Xu, Liming
Zhu, and Guoqiang Li. 2020. Object detection for graphical user interface: old
fashioned or deep learning or a combination? Proceedings of the 28th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering (Nov 2020). https://doi.org/10.1145/3368089.
3409691
[19]
Christian Degott, Nataniel P. Borges Jr., and Andreas Zeller. 2019. Learning
User Interface Element Interactions. In Proceedings of the 28th ACM SIGSOFT
International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA
2019). Association for Computing Machinery, New York, NY, USA, 296–306.
https://doi.org/10.1145/3293882.3330569
[20]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan,
Yang Li, Jerey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset
for Building Data-Driven Design Applications. In Proceedings of the 30th Annual
Symposium on User Interface Software and Technology (UIST ’17).
[21]
Dribbble. [n.d.]. Discover the world’s Top Designers &amp; Creatives. https:
//dribbble.com/
[22]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-
Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
In Proceedings of the Second International Conference on Knowledge Discovery and
Data Mining (Portland, Oregon) (KDD’96). AAAI Press, 226–231.
[23]
Michael W. Eysenck and Marc Brysbaert. 2018. Fundamentals of Cognition.
(2018). https://doi.org/10.4324/9781315617633
[24]
Ross Girshick. 2015. Fast R-CNN. In The IEEE International Conference on Com-
puter Vision (ICCV).
[25]
Jiaqi Guo, Shuyue Li, Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2019. SARA:
self-replay augmented record and replay for Android in industrial cases. In
Proceedings of the 28th acm sigsoft international symposium on software testing
and analysis. 90–100.
[26]
Boris Knyazev, Harm de Vries, Cătălina Cangea, Graham W. Taylor, Aaron
Courville, and Eugene Belilovsky. 2020. Graph Density-Aware Losses for Novel
Compositions in Scene Graph Generation. arXiv:2005.08230 [cs.CV]
[27]
Yann Lecun, Yoshua Bengio, and Georey Hinton. 2015. Deep learning. Nature
521, 7553 (2015), 436–444. https://doi.org/10.1038/nature14539
[28]
Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A
Deep Learning-Based Approach to Automated Black-box Android App Testing.In
2019 34th IEEE/ACM International Conference on Automated Software Engineering
(ASE). 1070–1073. https://doi.org/10.1109/ASE.2019.00104
[29]
Thomas F. Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and Ranjitha
Kumar. 2018. Learning Design Semantics for Mobile Apps. In The 31st Annual
ACM Symposium on User Interface Software and Technology (Berlin, Germany)
(UIST ’18). ACM, New York, NY, USA, 569–579. https://doi.org/10.1145/3242587.
3242650
[30]
Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett, and
Denys Poshyvanyk. 2020. Machine Learning-Based Prototyping of Graphical
User Interfaces for Mobile Apps. IEEE Transactions on Software Engineering 46, 2
(2020), 196–221. https://doi.org/10.1109/TSE.2018.2844788
[31]
Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse Engineering Mobile
Application User Interfaces with REMAUI. In Proceedings of the 30th IEEE/ACM
International Conference on Automated Software Engineering (Lincoln, Nebraska)
(ASE ’15). IEEE Press, 248–259. https://doi.org/10.1109/ASE.2015.32
[32]
Ju Qian, Zhengyu Shang, Shuoyan Yan, Yan Wang, and Lin Chen. 2020. Ro-
Script: A Visual Script Driven Truly Non-Intrusive Robotic Testing System
for Touch Screen Applications. In Proceedings of the ACM/IEEE 42nd Interna-
tional Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). As-
sociation for Computing Machinery, New York, NY, USA, 297–308. https:
//doi.org/10.1145/3377811.3380431
[33]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement.
CoRR abs/1804.02767 (2018). arXiv:1804.02767 http://arxiv.org/abs/1804.02767
[34]
Andy Rutledge. 2009. Gestalt Principles of Perception - 3: Proximity, Uniform
Connectedness, and Good Continuation. http://andyrutledge.com/gestalt-
principles-3.html
[35]
S1T2. 2021. Apply crap to design: S1T2 blog. https://s1t2.com/blog/step-1-
generously-apply- crap-to- design
ESEC/FSE 2022, 14 - 18 November, 2022, Singapore Trovato and Tobin, et al.
[36]
Onur Sahin, Assel Aliyeva, Hariharan Mathavan, Ayse Coskun, and Manuel Egele.
2019. Randr: Record and replay for android applications via targeted runtime
instrumentation. In 2019 34th IEEE/ACM International Conference on Automated
Software Engineering (ASE). IEEE, 128–138.
[37] Sketch. [n.d.]. https://www.sketch.com/
[38]
Sternberg and Robert. 2003. Cognitive Psychology Third Edition. Thomson
Wadsworth.
[39]
Herb Stevenson. [n.d.]. Emergence: The Gestalt Approach to Change.
http://www.clevelandconsultinggroup.com/articles/emergence-gestalt-
approach-to- change.php
[40]
Tesseract-Ocr. [n.d.]. tesseract-ocr/tesseract: Tesseract Open Source OCR Engine
(main repository). https://github.com/tesseract-ocr/tesseract
[41]
Thalion. 2020. Ui Design in practice: Gestalt principles. https://uxmist.com/
2019/04/23/ui-design- in-practice- gestalt-principles/
[42]
J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. 2013.
Selective Search for Object Recognition. International Journal of Computer Vision
104, 2 (01 Sep 2013), 154–171. https://doi.org/10.1007/s11263-013- 0620-5
[43]
UserTesting.2019. 7 Gestalt Principles of Visual Perception: Cognitive Psychology
for UX: UserTesting Blog. https://www.usertesting.com/blog/gestalt-principles#
proximity
[44]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show
and Tell: A Neural Image Caption Generator. arXiv:1411.4555 [cs.CV]
[45]
Thomas D. White, Gordon Fraser, and Guy J. Brown. 2019. Improving Random
GUI Testing with Image-based Widget Detection. In Proceedings of the 28th ACM
SIGSOFT International Symposium on Software Testing and Analysis (Beijing,
China) (ISSTA 2019). ACM, New York, NY, USA, 307–317. https://doi.org/10.
1145/3293882.3330551
[46]
Mulong Xie, Sidong Feng, Zhenchang Xing, Jieshan Chen, and Chunyang Chen.
2020. UIED: a hybrid tool for GUI element detection. In Proceedings of the 28th
ACM Joint Meeting on European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. 1655–1659.
[47]
Rahulkrishna Yandrapally, Andrea Stocco, and Ali Mesbah. 2020. Near-Duplicate
Detection in Web App Model Inference. In Proceedings of the ACM/IEEE 42nd
International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20).
Association for Computing Machinery, New York, NY, USA, 186–197. https:
//doi.org/10.1145/3377811.3380416
[48]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph
R-CNN for Scene Graph Generation. arXiv:1808.00191 [cs.CV]
[49]
YazdaniBanafsheDaragh. [n.d.]. Deep-GUI: Towards Platform-Independent UI
Input Generation with Deep Reinforcement Learning. UC Irvine ([n. d.]). https:
//escholarship.org/uc/item/3kv1n3qk
[50]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs:
Scene Graph Parsing with Global Context. arXiv:1711.06640 [cs.CV]
[51]
Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray,
Lisa Yu, Qi Shan, Jerey Nichols, Jason Wu, Chris Fleizach, Aaron Everitt, and
Jerey P. Bigham. 2021. Screen Recognition: Creating Accessibility Metadata for
Mobile Applications from Pixels. arXiv:2101.04893 [cs.HC]
[52]
Shuyu Zheng, Ziniu Hu, and Yun Ma. 2019. FaceO: Assisting the Manifestation
Design of Web Graphical User Interface. In Proceedings of the Twelfth ACM Inter-
national Conference on Web Search and Data Mining (Melbourne VIC, Australia)
(WSDM ’19). Association for Computing Machinery, New York, NY, USA, 774–777.
https://doi.org/10.1145/3289600.3290610
[53] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and
Jiajun Liang. 2017. EAST: an ecient and accurate scene text detector. 5551–5560.
... It addresses the detection and grouping of tiny GUI components, achieving high-quality front-end code generation. Further, Xie et al. [69] explored the use of Gestalt psychology principles for perceptual grouping of GUI widgets. It identifies visual patterns and groups related elements based on the laws of connectedness, similarity, proximity, and continuity. ...
... In practice, it is a huge task for us to check all the 500 apps we collected, so we intend to randomly select a subset of apps to construct the ground truth and evaluate our tool. Then, following the works [21,43,69], in which all selected subsets from their total dataset to evaluate the effectiveness of their method, we then select 32 apps based on the downloads. Further, to clarify the generalization of our method, we also randomly selected 16 apps from Google Play to participate in the experiment, so we test 48 apps in total and divide them into two groups as follows. ...
Preprint
Context: Accessibility issues (e.g., small size and narrow interval) in mobile applications (apps) lead to obstacles for billions of low vision users in interacting with Graphical User Interfaces (GUIs). Although GUI accessibility scanning tools exist, most of them perform rule-based check relying on complex GUI hierarchies. This might make them detect invisible redundant information, cannot handle small deviations, omit similar components, and is hard to extend. Objective: In this paper, we propose a novel approach, named ALVIN (Accessibility Checker for Low Vision), which represents the GUI as a graph and adopts the Graph Convolutional Neural Networks (GCN) to label inaccessible components. Method: ALVIN removes invisible views to prevent detecting redundancy and uses annotations from low vision users to handle small deviations. Also, the GCN model could consider the relations between GUI components, connecting similar components and reducing the possibility of omission. ALVIN only requires users to annotate the relevant dataset when detecting new kinds of issues. Results: Our experiments on 48 apps demonstrate the effectiveness of ALVIN, with precision of 83.5%, recall of 78.9%, and F1-score of 81.2%, outperforming baseline methods. In RQ2, the usefulness is verified through 20 issues submitted to open-source apps. The RQ3 also illustrates the GCN model is better than other models. Conclusion: To summarize, our proposed approach can effectively detect accessibility issues in GUIs for low vision users, thereby guiding developers in fixing them efficiently.
... They can be roughly summarized into two categories. Methods in the first category attempt to group fragmented layers based on GUI pixel images by computer vision algorithms [9,11]. To utilize the rich information contained in design prototypes, they create a semantic map by accessing the boundary information about layers and composite it with the GUI image to create a half-semantic image. ...
... To our knowledge, previous grouping methods can be summarized as three types: component-level, section-level, and layer-level. In the component-level category, UIED [11] detects and forms perceptual groups of GUI widgets based on a psychologically-inspired, unsupervised visual inference method. Some methods [5,6] adopt image captioning models to generate GUI view hierarchy from GUI images. ...
Preprint
Automatically constructing GUI groups of different granularities constitutes a critical intelligent step towards automating GUI design and implementation tasks. Specifically, in the industrial GUI-to-code process, fragmented layers may decrease the readability and maintainability of generated code, which can be alleviated by grouping semantically consistent fragmented layers in the design prototypes. This study aims to propose a graph-learning-based approach to tackle the fragmented layer grouping problem according to multi-modal information in design prototypes. Our graph learning module consists of self-attention and graph neural network modules. By taking the multimodal fused representation of GUI layers as input, we innovatively group fragmented layers by classifying GUI layers and regressing the bounding boxes of the corresponding GUI components simultaneously. Experiments on two real-world datasets demonstrate that our model achieves state-of-the-art performance. A further user study is also conducted to validate that our approach can assist an intelligent downstream tool in generating more maintainable and readable front-end code.
... Building upon previous research [17], we adopt a similar methodology to identify list structures. We utilize the DBSCAN clustering algorithm [18]-implementing it on the spatial distributions, such as widget's area, coordinates, and inter-widget gaps. ...
... In GUI design, software testing, and data monitoring, a lack of GUI metadata often hampers accessibility and label clarity. Studies like ReDraw [31], UIED [32], Xianyu [17], and screen recognition [33] have tackled this with image-based GUI understanding methods. ...
Preprint
Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the elderly and individuals with disabilities, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent. With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop PeriGuru in this work, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM). PeriGuru leverages a suite of computer vision techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms. PeriGuru achieves a success rate of 81.94% on the test task set, which surpasses by more than double the method without PeriGuru's GUI image interpreting and prompting design. Our code is available on https://github.com/Z2sJ4t/PeriGuru.
... A growing body of tools has been dedicated to assisting in automated app testing, based on randomness/evolution [37], [38], [10], UI modeling [39], [40], [41], [42], systematic exploration [43], [44], [45], and LLMs [11], [46], [24]. However, these tools are often constrained to test activities that are triggered by specific features. ...
Preprint
Full-text available
TikTok, a widely-used social media app boasting over a billion monthly active users, requires effective app quality assurance for its intricate features. Feature testing is crucial in achieving this goal. However, the multi-user interactive features within the app, such as live streaming, voice calls, etc., pose significant challenges for developers, who must handle simultaneous device management and user interaction coordination. To address this, we introduce a novel multi-agent approach, powered by the Large Language Models (LLMs), to automate the testing of multi-user interactive app features. In detail, we build a virtual device farm that allocates the necessary number of devices for a given multi-user interactive task. For each device, we deploy an LLM-based agent that simulates a user, thereby mimicking user interactions to collaboratively automate the testing process. The evaluations on 24 multi-user interactive tasks within the TikTok app, showcase its capability to cover 75% of tasks with 85.9% action similarity and offer 87% time savings for developers. Additionally, we have also integrated our approach into the real-world TikTok testing platform, aiding in the detection of 26 multi-user interactive bugs.
... Specifically, we use heuristics to monitor event logs to identify crash bugs. For non-crash bugs, we leverage previous studies such as UI display bug detection [8,14,35,36] and functional bug detection [9,16,19]. We set up the number of bugs as the evaluation metric to assess the usefulness of our approach. ...
Preprint
Full-text available
UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process.
Preprint
The Graphical User Interface (GUI) plays a critical role in the interaction between users and mobile applications (apps), aiming at facilitating the operation process. However, due to the variety of functions and non-standardized design, GUIs might have many accessibility issues, like the size of components being too small or their intervals being narrow. These issues would hinder the operation of low vision users, preventing them from obtaining information accurately and conveniently. Although several technologies and methods have been proposed to address these issues, they are typically confined to issue identification, leaving the resolution in the hands of developers. Moreover, it can be challenging to ensure that the color, size, and interval of the fixed GUIs are appropriately compared to the original ones. In this work, we propose a novel approach named AccessFixer, which utilizes the Relational-Graph Convolutional Neural Network (R-GCN) to simultaneously fix three kinds of accessibility issues, including small sizes, narrow intervals, and low color contrast in GUIs. With AccessFixer, the fixed GUIs would have a consistent color palette, uniform intervals, and adequate size changes achieved through coordinated adjustments to the attributes of related components. Our experiments demonstrate the effectiveness and usefulness of AccessFixer in fixing GUI accessibility issues. After fixing 30 real-world apps, our approach solves an average of 81.2% of their accessibility issues. Also, we apply AccessFixer to 10 open-source apps by submitting the fixed results with pull requests (PRs) on GitHub. The results demonstrate that developers approve of our submitted fixed GUIs, with 8 PRs being merged or under fixing. A user study examines that low vision users host a positive attitude toward the GUIs fixed by our method.
Article
Mobile applications (apps) are integral to our daily lives, offering diverse services and functionalities. They enable sighted users to access information coherently in an extremely convenient manner. However, it remains unclear if visually impaired users, who rely solely on the screen readers (e.g., Talkback) to navigate and access app information, can do so in the correct and reasonable order. This may result in significant information bias and operational errors. Furthermore, in our preliminary exploration, we explained and clarified that the navigation sequence-related issues encountered by visually impaired users could be categorized into two types: unintuitive navigation sequence and unapparent focus switching. Considering these issues, in this work, we proposed a method named RGNF ( R e-draw G UI N avigation F low). It aimed to enhance the understandability and coherence of accessing the content of each component within the Graphical User Interface (GUI), together with assisting developers in creating well-designed GUI navigation flow (GNF). This method was inspired by the characteristics identified in our preliminary study, where visually impaired users expected navigation to be associated with close position and similar shape of GUI components that were read consecutively. Thus, our method relied on the principles derived from the Gestalt psychological model, aiming to group GUI components into different regions according to the laws of proximity and similarity, thereby redrawing the GNFs. To evaluate the effectiveness of our method, we calculated sequence similarity values before and after redrawing the GNF, and further employed the tools proposed by Alotaibi et al. to measure the reachability of GUI components. Our results demonstrated a substantial improvement in similarity (0.921) compared to the baseline (0.624), together with the reachability (90.31%) compared to the baseline GNF (74.35%). Furthermore, a qualitative user study revealed that our method had a positive effect on providing visually impaired users with an improved user experience.
Conference Paper
Full-text available
Graphical User Interface (GUI) elements detection is critical for many GUI automation and GUI testing tasks. Acquiring the accurate positions and classes of GUI elements is also the very first step to conduct GUI reverse engineering or perform GUI testing. In this paper, we implement a User Interface Element Detection (UIED), a toolkit designed to provide user with a simple and easy-to-use platform to achieve accurate GUI element detection. UIED integrates multiple detection methods including old-fashioned computer vision (CV) approaches and deep learning models to handle diverse and complicated GUI images. Besides, it equips with a novel customized GUI element detection methods to produce state-of-the-art detection results. Our tool enables the user to change and edit the detection result in an interactive dashboard. Finally, it exports the detected UI elements in the GUI image to design files that can be further edited in popular UI design tools such as Sketch and Photoshop. UIED is evaluated to be capable of accurate detection and useful for downstream works. • Software and its engineering → Software development techniques; • Human-centered computing → Graphical user interfaces.
Article
Design sharing sites provide UI designers with a platform to share their works and also an opportunity to get inspiration from others' designs. To facilitate management and search of millions of UI design images, many design sharing sites adopt collaborative tagging systems by distributing the work of categorization to the community. However, designers often do not know how to properly tag one design image with compact textual description, resulting in unclear, incomplete, and inconsistent tags for uploaded examples which impede retrieval, according to our empirical study and interview with four professional designers. Based on a deep neural network, we introduce a novel approach for encoding both the visual and textual information to recover the missing tags for existing UI examples so that they can be more easily found by text queries. We achieve 82.72% accuracy in the tag prediction. Through a simulation test of 5 queries, our system on average returns hundreds more results than the default Dribbble search, leading to better relatedness, diversity and satisfaction.