Conference PaperPDF Available

Abstract and Figures

Graphical User Interface (GUI) elements detection is critical for many GUI automation and GUI testing tasks. Acquiring the accurate positions and classes of GUI elements is also the very first step to conduct GUI reverse engineering or perform GUI testing. In this paper, we implement a User Interface Element Detection (UIED), a toolkit designed to provide user with a simple and easy-to-use platform to achieve accurate GUI element detection. UIED integrates multiple detection methods including old-fashioned computer vision (CV) approaches and deep learning models to handle diverse and complicated GUI images. Besides, it equips with a novel customized GUI element detection methods to produce state-of-the-art detection results. Our tool enables the user to change and edit the detection result in an interactive dashboard. Finally, it exports the detected UI elements in the GUI image to design files that can be further edited in popular UI design tools such as Sketch and Photoshop. UIED is evaluated to be capable of accurate detection and useful for downstream works. • Software and its engineering → Software development techniques; • Human-centered computing → Graphical user interfaces.
Content may be subject to copyright.
UIED: A Hybrid Tool for GUI Element Detection
Mulong Xie
Australian National University
Canberra, Australia
mulong.xie@anu.edu.au
Sidong Feng
Australian National University
Canberra, Australia
u6063820@anu.edu.au
Zhenchang Xing
Australian National University
Canberra, Australia
Zhenchang.Xing@anu.edu.au
Jieshan Chen
Australian National University
Canberra, Australia
Jieshan.Chen@anu.edu.au
Chunyang Chen
Monash University
Melbourne, Australia
Chunyang.Chen@monash.edu
ABSTRACT
Graphical User Interface (GUI) elements detection is critical for
many GUI automation and GUI testing tasks. Acquiring the accu-
rate positions and classes of GUI elements is also the very rst step
to conduct GUI reverse engineering or perform GUI testing. In this
paper, we implement a
U
ser
I
terface
E
lement
D
etection (UIED), a
toolkit designed to provide user with a simple and easy-to-use plat-
form to achieve accurate GUI element detection. UIED integrates
multiple detection methods including old-fashioned computer vi-
sion (CV) approaches and deep learning models to handle diverse
and complicated GUI images. Besides, it equips with a novel cus-
tomized GUI element detection methods to produce state-of-the-art
detection results. Our tool enables the user to change and edit the
detection result in an interactive dashboard. Finally, it exports the
detected UI elements in the GUI image to design les that can be
further edited in popular UI design tools such as Sketch and Pho-
toshop. UIED is evaluated to be capable of accurate detection and
useful for downstream works.
Tool URL: http://uied.online
Github Link: https://github.com/MulongXie/UIED
CCS CONCEPTS
Software and its engineering Software development tech-
niques
;
Human-centered computing Graphical user in-
terfaces.
KEYWORDS
Object Detection, User Interface, Deep Learning, Computer Vision
ACM Reference Format:
Mulong Xie, Sidong Feng, Zhenchang Xing, Jieshan Chen, and Chunyang
Chen. 2020. UIED: A Hybrid Tool for GUI Element Detection. In Proceed-
ings of the 28th ACM Joint European Software Engineering Conference and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7043-1/20/11. . . $15.00
https://doi.org/10.1145/3368089.3417940
Symposium on the Foundations of Software Engineering (ESEC/FSE ’20), No-
vember 8–13, 2020, Virtual Event, USA. ACM, New York, NY, USA, 5pages.
https://doi.org/10.1145/3368089.3417940
1 INTRODUCTION
GUI oers a visualized way to display information and to interact
with the software application through graphical UI elements, such
as widgets, images and texts. The development of GUI is critical
and laborious. It involves many repetitive and time-consuming
tasks, such as GUI code implementation and GUI testing. Plenty of
researches working around GUI automation [
3
,
17
,
18
,
31
,
33
] and
testing [
21
,
30
] aim to facilitate the development process and relieve
pains of developers. The foundation of these tasks is to identify
GUI elements. There are two practices used to recognize elements
in GUI, instrumentation-based methods [
2
,
14
,
19
] and image-based
methods. Instrumentation-based approaches are based on intrusive
scripts and require accessibility of the back-end program [
4
,
12
].
However, they cost plentiful eort to write scripts and hard to use
when the back-end code is unavailable [
11
,
16
]. On the contrary,
image-based methods are more generic and less intrusive as it only
requires GUI image to detect GUI elements [
18
,
21
,
32
]. But, to
our best knowledge, there is no eective o-the-shelf GUI element
detection tool that user can use without any extra work. Therefore,
we developed an interactive web-based computer vision toolkit,
User Interface Element Detection (UIED), which provides quick
detection and easy management of GUI elements from GUI image.
UIED is a user-friendly web application where user can upload
their own GUI images and receive accurate GUI element detection
results. Detecting GUI elements from GUI image resembles object
detection task in natural scene. The process involves detecting
the presence and spatial location of certain target from natural
background in a digital image or video and then classies the de-
tected object. Similarly, in our case of GUI element detection, the
purpose is to identify and extract GUI widgets, images and text
from the GUI image, which can either be a screenshot or design
drawing. We implement 5 latest state-of-the-art methods in UIED,
including 2 old-fashioned computer vision methods (Xianyu [
32
],
REMAUI [
18
]) and 3 deep learning methods (Faster-RCNN [
23
],
Yolo v3 [
22
], CenterNet [
10
]). However, GUI elements have large
in-class variance and high cross-class similarity, while GUI designs
are packed scene and close-by elements, and mix of heterogeneous
objects [
7
]. These characteristics make it inadequate to apply the
aforementioned methods straightforwardly to perform accurate
ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA M. Xie, S. Feng, Z. Xing, J. Chen , C. Chen.
detection. Therefore, we design a novel GUI-specic element de-
tection approach based on old-fashioned computer vision methods.
The methods can be divided into two parts, non-text elements and
text elements. First, for the non-text elements, we leverage and
innovate a set of image processing algorithms (e.g. ood-ll [
29
],
connected component labelling [
24
]) to extract them and then clas-
sify them using a ResNet50 classier [
13
]. With the consideration
of GUI distinct boundary, shape, texture and layout, we adopt a
top-down coarse-to-ne detection strategy compared to bottom-up
edge/contour aggregation strategy in existing methods [
18
,
32
]. Sec-
ond, for the text elements, we apply a state-of-the-art deep learning
scene text model EAST [
34
]. By synergy of our novel old-fashioned
processes and mature deep learning classier, our method achieves
the state-of-the performance in GUI element detection.
UIED provides an interactive dashboard that allows user to edit
the result, such as dragging and dropping the element to change
location, adjusting element’s shape and size, removing elements,
etc. The tool collects all detected elements as a set of UI kits in
which they can be reused later. After a series of editing, UIED
allows user to export the result including the edited GUI and the
corresponding element information (e.g., position, size, class, etc.).
The exported results can be further developed in various works,
such as UI2CODE [
17
,
18
,
33
] applications that aim to automate GUI
development by generating corresponding code from GUI image
directly, and GUI testing [21,30].
This paper makes the following contributions:
We implement 5 existing detection approaches and our GUI-
specic detection method to acquire elements from GUI.
We develop an interactive web application UIED that allows
user to manage GUI elements easily and produces reusable
detection results for further development.
An informative investigation among professionals proving
the value of accurate GUI element detection approach.
2 APPLIED DETECTION METHODS
We applied both existing old-fashioned computer vision based meth-
ods and deep learning models retrained on mobile GUI images in
UIED. Old-fashioned computer vision based methods process im-
ages pixel by pixel without using machine learning techniques.
They are easy to deploy and adjust as no time-consuming training
required. We apply 2 existing GUI detection methods, Xianyu [
32
]
and REMAUI [
18
]. On the other hand, deep learning achieves re-
markable success in object detection research areas and is able to
predict results fast, we hence retrain 3 representative and state-of-
the-art approaches, Faster-RCNN [
23
] (two-stages), Yolo v3 [
22
]
(one-stage), CenterNet [
10
] (anchor free) and deploy them on UIED.
Note that to perform GUI element detection, we retrain the deep
learning models on a large mobile app screenshot dataset Rico [
8
].
REMAUI
: It is a GUI reverse engineering work that converts
mobile GUI image into code, leveraging the o-the-shelf image
processing algorithms from OpenCV [
27
] library. For detecting
non-text GUI elements, it adopts a bottom-up strategy where it
rst uses Canny edge [
5
] detection to acquire primitive shapes and
regions of image content (e.g, edge, contour) and then aggregate
them into objects progressively. It applies a simple optical character
recognition (OCR) tool Tesseract [25] to detect GUI text.
Xianyu
: This is another GUI reverse engineering work devel-
oped by Alibaba to synthesis code from GUI images. It adopts a
similar idea as REMAUI. To improve the non-text element detection,
it leverages the ood ll algorithm [
29
] to identify the connected
regions and lters out the noise from the complex background,
combined with recursive horizontal/vertical slicing to obtain the
GUI elements Tesseract is also used to detect GUI text.
Faster-RCNN
: It is a classic "two-stage" method which involves
two steps: detection and classication. It rst generates a set of
region proposals in which likely to contain objects by a region
proposal network (RPN). RPN uses a set of user-dened anchor
boxes with dierent aspect ratios and computes an objectness score
to determine whether the box contains an object. It regresses the
anchor boxes to predict the object’s bounding box. Then it uses a
CNN-based image classier to categorize the detected objects.
Yolo v3
: Unlike Faster-RCNN, YOLO performs region regres-
sion and object classication at once. It determines the anchor box
aspect ratio automatically through clustering ground truth in the
training dataset. It generates a gridding feature map through CNN
and produces a set of bounding boxes for each grid. YOLO then
computes the objectness scores, regresses the box coordinates and
classies the object in the bounding box at the same time.
CenterNet
: Both Yolo and Faster-RCNN depends on anchor
boxes to detect targets, whose performance is aected by the ratio
aspect of these anchor boxes. They are also ineective on objects
with various shapes that cannot t into these boxes. To address
these limitations, CenterNet uses an anchor-free technique. It is a
one-stage detection model that predicts the position of the top-left
and bottom-right corners and the centre of an object.
3 OUR HYBRID APPROACH
With consideration of the characteristics of the GUI image, we pro-
pose a novel GUI specic element detection approach. Our method
divides the detection task into two part: non-text element detection
and text detection. We leverage old-fashioned computer vision algo-
rithms for non-text region extraction, and deep learning models to
perform classications and text detection. The synergy reduces the
disturbance of text when detecting non-text elements, and achieves
the state-of-the-art performance for GUI element detection. Figure 1
shows the process of our approach.
Non-Text GUI Element Detection
Unlike deep learning mod-
els that utilize statistical regression to predict approximate bound-
ing box, old-fashioned computer vision method can detect the po-
sition and shape of objects more accurately due to its pixel-level
image processing. But the existing old-fashioned methods usually
adopt a bottom-up strategy that aggregates the ne details (e.g.,
edge or contour) into objects. Such idea suers noise of trivial im-
age content and tends to over-segment GUI elements, especially
when the GUI has a complex background. Therefore, we propose
a top-down coarse-to-ne approach based on old-fashioned com-
puter vision techniques for non-text GUI elements detection. The
process involves 3 steps. First, the approach detects layout blocks
through ood-lling algorithm [
29
] combined with the Sklansky’s
algorithm [
26
] to acquire the blocks’ outer boundaries and pro-
duces a block map, as shown in Figure 1(c) where dierent colour
regions stand for potential dierent layout blocks. Then we use a
UIED: A Hybrid Tool for GUI Element Detection ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA
Figure 1: The overall process of our approach, where (a)
is the input GUI, (b) is the text detection result by EAST,
(c),(d),(e) is non-text elements detection and (f) is the merged
nal result.
shape recognition algorithm [
20
] to select rectangular regions and
count them as GUI layout blocks. Second, the method generates
a binary map by a simple but ecient binarization method based
on the gradient map of the input GUI. If a pixel’s gradient with
neighbouring is small, it is regarded as a background point and
coloured as black, otherwise white. Then, we segment the binary
map into block segments based on previously detected blocks, as
shown in Figure 1(d). In each block binary map, we detect GUI
elements by connected component labelling algorithm [
24
] and use
Sklansky’s algorithm again to determine the elements’ boundary.
Third, we train a ResNet 50 [
13
] classier on 90,000 GUI element
instances with 15 categories to classify the extracted elements.
GUI Text Element Detection
We determine GUI text as a scene
text and apply the state-of-the-art deep learning scene test detector
EAST [
34
] to detect text in the GUI image. It rst feeds the input
image into a feature pyramid network [
15
] and then computes six
values for each point based on the nal feature map to detect text
(objectness score, top/left/bottom/right osets and rotation angle).
4 WEB IMPLEMENTATION
UIED toolkit is a web application that provides the user with a
convenient tool to detect and manage GUI elements in GUI images.
It can export the detection results that can be further used in other
applications such as GUI testing and GUI automation. In UIED, we
integrate all of the previously mentioned approaches, including
old-fashioned computer vision and deep learning methods. This
tool also oers an interactive dashboard where the user can edit
and manage the detection result. We implement Xianyu [
32
] and
our own approach in OpenCV [
27
] and customize deep learning
models in Tensorow [
1
] and Pytorch [
28
]. There are two major
parts of the UIED: the landing page and the dashboard.
Landing Page
Figure 2(a) shows an illustration of our landing
page, which displays the basic information and usage of UIED. Users
are able to input a GUI image to be processed. They can either
select example GUIs we provide to check the eect of detection and
experience the basic usage of the dashboard or upload their own
GUI to detect GUI elements. Furthermore, for our method, we allow
the user to change some key parameters by slide bars to adjust
the detection result. To facilitate image transmission in the server,
we adopt a serializing structured data method, Google’s Protocol
Buers method [9], which encodes the image into a buer bytes.
Dashboard
UIED disassembles the input GUI into draggable
GUI elements according to the detection result and displays them
on the dashboard (in Figure 2(b)). In the dashboard, we implemented
several functionalities to provide a more user-friendly interaction
experience, including:
Drag & Drop: The user can adjust the position of GUI element to
manually correct the detection result through dragging the element
and dropping to somewhere else.
Attributes Management: When clicking a GUI element, the user
can easily access the attributes of the element (e.g., type, width,
height, left, top). We provide users with the ability to quickly edit
the element, such as applying a new size and precise position, delete
existing ones and withdrawing changes.
UI Element Kit: UI kit is the most ecient and protable way to
build a rapid GUI [
6
]. Therefore, we store all the detected elements
in the dataset for the user to reuse. If the user further processes the
input GUI in another method, the new GUI elements will also be
added to the UI kits.
Detection Result Export After adjustment, the user can export
the GUI element information in the compounded image, including
position, size and class of elements. This information will be stored
in a JSON le and is easy to use in further applications.
ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA M. Xie, S. Feng, Z. Xing, J. Chen , C. Chen.
(a) Landing Page (b) Dashboard
Figure 2: Illustration of our UIED web application.
5 EVALUATION
The goal of our study is to evaluate the usefulness of UIED in terms
of (i) its eectiveness in detecting elements and (ii) the usability of
downstream tasks.
Eectiveness Measurement
: Regarding the eectiveness of
element detection in multiple approaches, we conduct experiments
on 5k Android mobile GUIs collected from Rico [
8
]. This part is also
published in our previous work [
7
] where we present a more de-
tailed analysis. The main evaluation metrics we used is the F1-score
which is interpreted as a weighted average of the precision and re-
call. Note that the measurements are evaluated on IoU > 0.9, where
the IoU is the intersection area over union area of the detected
bounding box and the ground-truth box. We further measure the
cost of time to show the eciency of each approach. Table 1shows
the performance of all approaches. The existing old-fashioned detec-
tion methods perform poorly (REMAUI F1=0.183, Xianyu F1=0.106)
and the deep learning models gain better performance (Faster RCNN
F1=0.271, YOLOv3 F1=0.249, CenterNet F1=0.282). Our approach
achieves state-of-the-art performance (F1=0.524). Note that part
of the reason why the score is not as high as expected is that the
dataset itself is not perfectly precise. Combining the multiple mod-
els and the ability to manual adjustment, UIED is able to produce
more accurate detection result as shown in Figure 2(a)
Usability Measurement
: Regarding the user experience of UIED,
we create a survey on 10 professional developers and researchers
who come from GUI related work. They are asked to use UIED and
questioned about the tool’s usefulness for their work, as well as the
future potential and extension of the tool.
Among those professionals, three of them are working on GUI
reverse engineering research that synthesizes GUI code from GUI
image (UI2Code). All of them indicated the signicance of accurate
GUI element detection for generating high-quality code while they
have no o-the-shelf detection method to use. Similar situation in
the other four participants who are researching robotic GUI auto-
matic testing. They want to apply the robot arm to simulate human
tester to test the mobile apps without writing any testing script,
which means the visual information is critical. These researchers
stated that although the GUI element detection producing exact
element information to direct the robot tester is vital, there is no
mature and accurate existing domain-specic method. And they
believed an easy-to-use tool like UIED would be "more than helpful".
Table 1: Results of object detection (IoU >0.9) and runtime
eciency
Approach F1-score Avg Time
YOLOv3 0.249 0.22s
Faster-RCNN 0.271 0.38s
CenterNet 0.282 0.34s
Xianyu 0.324 1.2s
REMAUI 0.357 5.3s
UIED 0.524 4.8s
We also surveyed two web developers. They agreed that a web ap-
plication that recognizes the GUI elements in their design drawing
and allows them to edit the image is "interesting and helpful". They
were also very interested in the further potential of UIED which
would support online UI2CODE function in future and think such
a tool can be practical assistance in web development.
6 CONCLUSION AND FUTURE WORK
In this demo, we present UIED, a GUI element detection toolkit
which supports two old-fashioned computer vision approaches
and three commonly used deep learning approaches. Furthermore,
based on the distinct characteristics of GUI, we implement a novel
approach that combines best practices for non-text GUI element
and GUI text detection. We embed our approach in the UIED with
the option to adjust the key parameters to best adapt to the given
GUI image UIED also provides users with an interactive and respon-
sive dashboard with various useful functionalities to optimize the
detection result, such as drag and drop, size and class editor. Finally,
it can export the edited GUI image and corresponding GUI element
information for further usage. UIED was evaluated in aspects of
detection accuracy and tool usefulness. The evaluation suggests
that UIED is a good starting point of software engineering in GUI
tasks.
For future work, UIED has signicant potential to be expanded
with other applications. For instance, our ongoing UI2CODE project
that aims to synthesis code from a given GUI image will be added
to the tool once mature. With code generation, this tool will be
dramatically helpful in GUI development in the way that the de-
signer can pass their GUI design to our tool and get the usable
code eciently. Also, we are planning to utilize the UIED as the
detection part for automatic GUI testing, which is expected to be
added as an extension in UIED in the future.
UIED: A Hybrid Tool for GUI Element Detection ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA
REFERENCES
[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jerey Dean, Matthieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
https://www.tensorow.org/ Software available from tensorow.org.
[2]
Lingfeng Bao, Jing Li, Zhenchang Xing, Xinyu Wang, and Bo Zhou. 2015. scvRip-
per: video scraping tool for modeling developers’ behavior using interaction data.
In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering,
Vol. 2. IEEE, 673–676.
[3]
Carlos Bernal-Cardenas, Nathan Cooper, Kevin Moran, Oscar Chaparro, Andrian
Marcus, and Denys Poshyvanyk. 2020. Translating Video Recordings of Mo-
bile App Usages into Replayable Scenarios. In 42nd International Conference on
Software Engineering (ICSE ’20). ACM, New York, NY.
[4]
Karl Bridge and Michael Satran. 2018. Windows Accessibility API overview. Re-
trieved March 2, 2020 from https://docs.microsoft.com/en-us/windows/win32/
winauto/windows-automation- api-portal
[5]
J. Canny. 1986. A Computational Approach to Edge Detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence PAMI-8, 6 (Nov 1986), 679–698.
https://doi.org/10.1109/TPAMI.1986.4767851
[6]
Chunyang Chen, Sidong Feng, Zhenchang Xing, Linda Liu, Shengdong Zhao,
and Jinshui Wang. 2019. Gallery DC: Design Search and Knowledge Discovery
through Auto-created GUI Component Gallery. Proceedings of the ACM on
Human-Computer Interaction 3, CSCW (2019), 1–22.
[7]
Jieshan Chen, Mulong Xie, Zhenchang Xing, Chunyang Chen, Xiwei Xu, Liming
Zhu, and Guoqiang Li. 2020. Object Detection for Graphical User Interface: Old
Fashioned or Deep Learning or a Combination? arXiv:2008.05132 [cs.CV]
[8]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan,
Yang Li, Jerey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset
for building data-driven design applications. In Proceedings of the 30th Annual
ACM Symposium on User Interface Software and Technology. 845–854.
[9]
Google Developers. 2020. Protocol Buers  |  Google Developers.
https://developers.google.com/protocol-buers
[10]
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi
Tian. 2019. Centernet: Keypoint triplets for object detection. In Proceedings of
the IEEE International Conference on Computer Vision. 6569–6578.
[11]
Google. 2019. UI Automator. Retrieved March 2, 2020 from https://developer.
android.com/training/testing/ui-automator
[12]
Google. 2020. Build more accessible apps. Retrieved March 2, 2020 from https:
//developer.android.com/guide/topics/ui/accessibility
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[14]
Feng Lin, Chen Song, Xiaowei Xu, Lora Cavuoto, and Wenyao Xu. 2016. Sensing
from the bottom: Smart insole enabled patient handling activity recognition
through manifold learning. In 2016 IEEE First International Conference on Con-
nected Health: Applications, Systems and Engineering Technologies (CHASE). IEEE,
254–263.
[15]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and
Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition. 2117–2125.
[16]
Microsoft. 2016. Introducing Spy++. Retrieved March 2, 2020
from https://docs.microsoft.com/en-us/visualstudio/debugger/introducing- spy-
increment?view=vs-2019
[17]
Kevin Moran, Boyang Li, Carlos Bernal-Cárdenas, Dan Jelf, and Denys Poshy-
vanyk. 2018. Automated reporting of GUI design violations for mobile apps. In
Proceedings of the 40th International Conference on Software Engineering. 165–175.
[18]
Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobile
application user interfaces with remaui (t). In 2015 30th IEEE/ACM International
Conference on Automated Software Engineering (ASE). IEEE, 248–259.
[19]
Suporn Pongnumkul, Mira Dontcheva, Wilmot Li, Jue Wang, Lubomir Bourdev,
Shai Avidan, and Michael F Cohen. 2011. Pause-and-play: automatically linking
screencast video tutorials with applications. In Proceedings of the 24th annual
ACM symposium on User interface software and technology. 135–144.
[20]
Dilip K. Prasad, Maylor K.H. Leung, Chai Quek, and Siu-Yeung Cho. 2012. A
novel framework for making dominant point detection methods non-parametric.
Image and Vision Computing 30, 11 (2012), 843 – 859. https://doi.org/10.1016/j.
imavis.2012.06.010
[21]
Ju Qian, Zhengyu Shang, Shuoyan Yan, Yan Wang, and Lin Chen. 2020. RoScript:
A Visual Script Driven Truly Non-Intrusive Robotic Testing System for Touch
Screen Applications. In 42nd International Conference on Software Engineering
(ICSE ’20). ACM, New York, NY.
[22]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement.
arXiv preprint arXiv:1804.02767 (2018).
[23]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems. 91–99.
[24]
H. Samet and M. Tamminen. 1988. Ecient component labeling of images of
arbitrary dimension represented by linear bintrees. IEEE Transactions on Pattern
Analysis and Machine Intelligence 10, 4 (1988), 579–586. https://doi.org/10.1109/
34.3918
[25]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth International
Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, 629–
633.
[26]
Satoshi Suzuki and KeiichiA be. 1985. Topological structural analysis of digi-
tized binary images by border following. Computer Vision, Graphics, and Image
Processing 30, 1 (1985), 32 – 46. https://doi.org/10.1016/0734-189X(85)90016- 7
[27] OpenCV team. 2020. https://opencv.org/
[28] Pytorch Team. 2020. https://pytorch.org/
[29] Shane Torbert. 2016. Applied computer science. Springer.
[30]
Thomas D White, Gordon Fraser, and Guy J Brown. 2019. Improving random
GUI testing with image-based widget detection. In Proceedings of the 28th ACM
SIGSOFT International Symposium on Software Testing and Analysis. 307–317.
[31]
Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI
screenshots for search and automation. In Proceedings of the 22nd annual ACM
symposium on User interface software and technology. 183–192.
[32]
Chen Yongxin, Zhang Tonghui, and Chen Jie. 2019. UI2code: How
to Fine-tune Background and Foreground Analysis. Retrieved Feb 23,
2020 from https://laptrinhx.com/ui2code-how- to-ne- tune-background- and-
foreground-analysis- 2293652041/
[33]
Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xiwei Xu, Liming Zhu, Guoqiang
Li, and Jinshui Wang. 2020. Seenomaly: Vision-Based Linting of GUI Animation
Eects Against Design-Don’t Guidelines. In 42nd International Conference on
Software Engineering (ICSE ’20). ACM, New York, NY, 12 pages. https://doi.org/
10.1145/3377811.3380411
[34]
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He,
and Jiajun Liang. 2017. EAST: an ecient and accurate scene text detector. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
5551–5560.
... This reverseengineering process involves three steps: first, video segmentation to segment the video into video fragments of individual actions; second, structured action recognition to generate howwhat-where information of HCI actions from video fragments; third, widget identification to identify GUI widget details involved the actions. Techniques have been developed for video segmentation [24], [25], simple HCI action prediction [26], [27], and widget detection [28]. However, effective techniques for structured action recognition from action videos are missing, without which we will not have a working reverseengineering pipeline. ...
... To demonstrate the usefulness of our model, we integrate it with a simple heuristic-based video fragmentation method and the GUI widget detection method [28] into a screencastto-actionscript tool. We apply this tool to 100 bug reproduction screencasts from Firefox and obtain the action scripts for reproducing these bugs. ...
... For widget detection, we use an existing technique UIED [28] which can detect the bounding box of a widget and recognize its class. For detected text widgets, UIED uses Google OCR [48] to convert text widget images to texts. ...
Preprint
Full-text available
UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support, or assume specific app implementations. Reverse engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognizing human-understandable structured user actions ([command] [widget] [location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model that can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record user interactions with Word, Zoom, Firefox, Photoshop, and Windows 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.
... Numerous studies have focused on advancing performance in GUIrelated tasks. Some work focuses on GUI element detection: UIED [22] combines traditional CV and deep learning for element detection, and Chen et al. [6] merge coarse-to-fine strategies with deep learning. Another task is GUI retrieval. ...
Preprint
Full-text available
Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
... A growing body of tools has been dedicated to assisting in automated app testing, based on randomness/evolution [37], [38], [10], UI modeling [39], [40], [41], [42], systematic exploration [43], [44], [45], and LLMs [11], [46], [24]. However, these tools are often constrained to test activities that are triggered by specific features. ...
Preprint
Full-text available
TikTok, a widely-used social media app boasting over a billion monthly active users, requires effective app quality assurance for its intricate features. Feature testing is crucial in achieving this goal. However, the multi-user interactive features within the app, such as live streaming, voice calls, etc., pose significant challenges for developers, who must handle simultaneous device management and user interaction coordination. To address this, we introduce a novel multi-agent approach, powered by the Large Language Models (LLMs), to automate the testing of multi-user interactive app features. In detail, we build a virtual device farm that allocates the necessary number of devices for a given multi-user interactive task. For each device, we deploy an LLM-based agent that simulates a user, thereby mimicking user interactions to collaboratively automate the testing process. The evaluations on 24 multi-user interactive tasks within the TikTok app, showcase its capability to cover 75% of tasks with 85.9% action similarity and offer 87% time savings for developers. Additionally, we have also integrated our approach into the real-world TikTok testing platform, aiding in the detection of 26 multi-user interactive bugs.
... The high-resolution displays and complicated modern GUI pattern challenge the agent's ability to correctly interact with the device. Traditional detection and OCR approaches [3,32] fall short of understanding the functionalities of UI components, necessitating a large VLM to support this task. Another significant challenge is the planning and execution of multi-step tasks that often involve long sequences of actions, thus highly demanding the agent's long-term and dynamic planning capabilities. ...
Preprint
Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.
... For mobile GUIs, we selected the Enrico and VINS datasets for their well-labeled GUI element bounding boxes. To further refine these annotations, we applied the UIED[52] model, which detects ...
Preprint
Full-text available
During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.
... Widget detection, as demonstrated by White et al., [90] (2019), leverages computer vision to accurately identify UI elements, serving as a supporting technique that enables more intelligent and responsive UI automation. By detecting and categorizing interface components, this approach enhances the agent's ability to interact effectively with complex and dynamic GUIs [128]. ...
Preprint
Full-text available
GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.
Article
UI task automation enables efficient task execution by simulating human interactions with graphical user interfaces (GUIs), without modifying the existing application code. However, its broader adoption is constrained by the need for expertise in both scripting languages and workflow design. To address this challenge, we present Prompt2Task, a system designed to comprehend various task-related textual prompts (e.g., goals, procedures), thereby generating and performing the corresponding automation tasks. Prompt2Task incorporates a suite of intelligent agents that mimic human cognitive functions, specializing in interpreting user intent, managing external information for task generation, and executing operations on smartphones. The agents can learn from user feedback and continuously improve their performance based on the accumulated knowledge. Experimental results indicated a performance jump from a 22.28% success rate in the baseline to 95.24% with Prompt2Task, requiring an average of 0.69 user interventions for each new task. Prompt2Task presents promising applications in fields such as tutorial creation, smart assistance, and customer service.
Article
Full-text available
The inception of a mobile app often takes form of a mock-up of the Graphical User Interface (GUI), represented as a static image delineating the proper layout and style of GUI widgets that satisfy requirements. Following this initial mock-up, the design artifacts are then handed off to developers whose goal is to accurately implement these GUIs and the desired functionality in code. Given the sizable abstraction gap between mock-ups and code, developers often introduce mistakes related to the GUI that can negatively impact an app's success in highly competitive marketplaces. Moreover, such mistakes are common in the evolutionary context of rapidly changing apps. This leads to the time-consuming and laborious task of design teams verifying that each screen of an app was implemented according to intended design specifications. This paper introduces a novel, automated approach for verifying whether the GUI of a mobile app was implemented according to its intended design. Our approach resolves GUI-related information from both implemented apps and mock-ups and uses computer vision techniques to identify common errors in the implementations of mobile GUIs. We implemented this approach for Android in a tool called GVT and carried out both a controlled empirical evaluation with open-source apps as well as an industrial evaluation with designers and developers from Huawei. The results show that GVT solves an important, difficult, and highly practical problem with remarkable efficiency and accuracy and is both useful and scalable from the point of view of industrial designers and developers. The tool is currently used by over one-thousand industrial designers and developers at Huawei to improve the quality of their mobile apps.
Article
Online communities like Dribbble and GraphicBurger allow GUI designers to share their design artwork and learn from each other. These design sharing platforms are important sources for design inspiration, but our survey with GUI designers suggests additional information needs unmet by existing design sharing platforms. First, designers need to see the practical use of certain GUI designs in real applications, rather than just artworks. Second, designers want to see not only the overall designs but also the detailed design of the GUI components. Third, designers need advanced GUI design search abilities (e.g., multi-facets search) and knowledge discovery support (e.g., demographic investigation, cross-company design comparison). This paper presents Gallery D.C. http://mui-collection.herokuapp.com/, a gallery of GUI design components that harness GUI designs crawled from millions of real-world applications using reverse-engineering and computer vision techniques. Through a process of invisible crowdsourcing, Gallery D.C. supports novel ways for designers to collect, analyze, search, summarize and compare GUI designs on a massive scale. We quantitatively evaluate the quality of Gallery D.C. and demonstrate that Gallery D.C. offers additional support for design sharing and knowledge discovery beyond existing platforms.
Conference Paper
Graphical User Interfaces (GUIs) are amongst the most common user interfaces, enabling interactions with applications through mouse movements and key presses. Tools for automated testing of programs through their GUI exist, however they usually rely on operating system or framework specific knowledge to interact with an application. Due to frequent operating system updates, which can remove required information, and a large variety of different GUI frameworks using unique underlying data structures, such tools rapidly become obsolete, Consequently, for an automated GUI test generation tool, supporting many frameworks and operating systems is impractical. We propose a technique for improving GUI testing by automatically identifying GUI widgets in screen shots using machine learning techniques. As training data, we generate randomized GUIs to automatically extract widget information. The resulting model provides guidance to GUI testing tools in environments not currently supported by deriving GUI widget information from screen shots only. In our experiments, we found that identifying GUI widgets in screen shots and using this information to guide random testing achieved a significantly higher branch coverage in 18 of 20 applications, with an average increase of 42.5% when compared to conventional random testing.