Article

GUing: A Mobile GUI Search Engine using a Vision-Language Model

Authors:
  • EuroMov Digital Health in Motion IMT Mines Alès Univ. Montpellier
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e. labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, current UI retrieval methods are limited in their ability to capture the nuanced semantic elements essential for an effective, inspirational design process. Prior research [22,52] has focused on pixel-level retrieval methods, emphasizing visual similarities. While these approaches are effective for identifying surface-level design patterns, they often overlook in-depth semantic aspects such as user interaction flow, target audience, and the emotional tone of an application. ...
... Park and Kim et al. [43] demonstrated that large-scale foundation models like CLIP [44], trained without relying on additional metadata, can achieve superior retrieval performance compared to traditional approaches. As a stretching of this approach, GUIClip [52], a CLIP model fine-tuned with UI datasets, also showed better retrieval performance than prior methods. ...
... As a prior study [43] revealed that foundation models like CLIP [44] outperform traditional UI-specific models (e.g., Screen2Vec [31], Deka et al. [13]) in retrieving UI screens (based on the app, screen type, and content similarity) tasks, without needing UI structure or other metadata. Based on these findings, we focused our comparison on state-of-the-art vision foundation models, including CLIP models (ViT-B/32 (base) and ViT-L/14@336px (most capable)), and GUIClip [52], a ViT-B/32 based CLIP model that finetuned on UI datasets for retrieving similar UI screens. We performed zero-shot classification using these models, following the methodology outlined in prior studies [24,43,52]. ...
Preprint
Full-text available
Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies.
... eral and app UI design in particular. Recent studies reveal that AI can be effectively used for accurate text-to-UI retrieval [3,4]. Moreover, Generative AI, encompassing both image and text generation, can be applied to UI generation, offering an infinite source of design inspiration [5,6]. ...
... To tackle this problem, Kolthoff et al. proposed RaWi [4] an approach using text embedding (from metadata available about the UI) for accurate UI retrieval. Our recent work demonstrates that VLMs clearly surpasses text embedding model in text-to-UI retrieval [3]. Vision-Language Models, such as CLIP [11], are multimodal models, capable of learning from both images and text (i.e. ...
... VLM Fine-tuning The vanilla CLIP model [11], originally trained on general image-caption datasets that include images of dogs, buildings, and other common subjects, may not perform well in text-to-UI retrieval task. To achieve better performance, fine-tuning the VLM with a UI-caption dataset, such as GPSCap [3], is recommended. Instead of fine-tuning your own VLM model for tasks, we have made our UIClip model publicly available 13 . ...
Preprint
Graphical User Interface (or simply UI) is a primary mean of interaction between users and their device. In this paper, we discuss three major complementary approaches on how to use Artificial Intelligence (AI) to support app designers create better, more diverse, and creative UI of mobile apps. First, designers can prompt a Large Language Model (LLM) like GPT to directly generate and adjust one or multiple UIs. Second, a Vision-Language Model (VLM) enables designers to effectively search a large screenshot dataset, e.g. from apps published in app stores. The third approach is to train a Diffusion Model (DM) specifically designed to generate app UIs as inspirational images. We discuss how AI should be used, in general, to inspire and assist creative app design rather than automating it.
... Our approach facilitates example-based retrieval [46], transforming design examples into vectors and retrieving them based on similarity measures, as illustrated in Figure 8. When designers input an image or a structured format of a TAB, our method converts it into a vector and calculates the cosine similarity Unlike traditional keyword-based search methods and advanced natural languagebased retrieval techniques [47], our framework does not rely on manually crafted labels or descriptions for each UI image. Instead, it uses a domain-specific intermediate representation derived directly from the original code in UI repositories. ...
Article
Full-text available
The Top App Bar (TAB) seamlessly integrates essential elements such as app titles, navigation icons, action buttons, and search fields without creating visual clutter. However, designing a well-structured TAB presents a challenge, particularly for novice UI designers, due to the need to balance aesthetics, functionality, usability, and user experience within a limited space. This study introduces an auxiliary design method to address this challenge. It proposes the sequence representation learning technique to cluster TABs in software repositories based on their structure. A novice designer can input their preconceptualized structure to retrieve design examples from the software repository’s TAB clusters that have structures identical or similar to their concepts. Experimental results demonstrate the method’s effectiveness, achieving an accuracy of 66.7% and an F-1 score of 0.717, highlighting its alignment with human clustering. This method not only enhances the design efficiency of novice designers but also helps them understand successful design practices in various contexts. By avoiding common pitfalls and design errors, designers can adapt and innovate based on existing solutions. The dataset used in this study, containing approximately 4228 TABs, is available on Zenodo.
... The app introduction images on Google Play are a gold mine for the inspiration of app design, particularly the Graphical User Interface (GUI), as they are carefully selected by app developers to represent the important features of the apps. Recent researches mines the app introduction images and proposed GUI search engines, such as Gallery D.C. [8,15], and GUing [52], to facilitate the search of existing app UI designs using textual queries. Recently, Wei et al. discussed how LLM-Inspiration can be combined with GUI-Mining with the app designer in the loop [49]. ...
Preprint
Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors' apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.
Article
Full-text available
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey .
Article
Full-text available
Rapid GUI prototyping has evolved into a widely applied technique in early stages of software development to facilitate the clarification and refinement of requirements. Especially high-fidelity GUI prototyping has shown to enable productive discussions with customers and mitigate potential misunderstandings, however, the benefits of applying high-fidelity GUI prototypes are accompanied by the disadvantage of being expensive and time-consuming in development and requiring experience to create. In this work, we show RaWi, a data-driven GUI prototyping approach that effectively retrieves GUIs for reuse from a large-scale semi-automatically created GUI repository for mobile apps on the basis of Natural Language (NL) searches to facilitate GUI prototyping and improve its productivity by leveraging the vast GUI prototyping knowledge embodied in the repository. Retrieved GUIs can directly be reused and adapted in the graphical editor of RaWi. Moreover, we present a comprehensive evaluation methodology to enable (i) the systematic evaluation of NL-based GUI ranking methods through a novel high-quality gold standard and conduct an in-depth evaluation of traditional IR and state-of-the-art BERT-based models for GUI ranking, and (ii) the assessment of GUI prototyping productivity accompanied by an extensive user study in a practical GUI prototyping environment.
Article
Full-text available
Research has repeatedly shown that high-quality requirements are essential for the success of development projects. While the term “quality” is pervasive in the field of requirements engineering and while the body of research on requirements quality is large, there is no meta-study of the field that overviews and compares the concrete quality attributes addressed by the community. To fill this knowledge gap, we conducted a systematic mapping study of the scientific literature. We retrieved 6905 articles from six academic databases, which we filtered down to 105 relevant primary studies. The primary studies use empirical research to explicitly define, improve, or evaluate requirements quality. We found that empirical research on requirements quality focuses on improvement techniques, with very few primary studies addressing evidence-based definitions and evaluations of quality attributes. Among the 12 quality attributes identified, the most prominent in the field are ambiguity, completeness, consistency, and correctness. We identified 111 sub-types of quality attributes such as “template conformance” for consistency or “passive voice” for ambiguity. Ambiguity has the largest share of these sub-types. The artefacts being studied are mostly referred to in the broadest sense as “requirements”, while little research targets quality attributes in specific types of requirements such as use cases or user stories. Our findings highlight the need to conduct more empirically grounded research defining requirements quality, using more varied research methods, and addressing a more diverse set of requirements types.
Article
Full-text available
UI (User Interface) is an essential factor influencing users’ perception of an app. However, it is hard for even professional designers to determine if the UI is good or not for end-users. Users’ feedback (e.g., user reviews in the Google Play) provides a way for app owners to understand how the users perceive the UI. In this article, we conduct an in-depth empirical study to analyze the UI issues of mobile apps. In particular, we analyze more than 3M UI-related reviews from 22,199 top free-to-download apps and 9,380 top non-free apps in the Google Play Store. By comparing the rating of UI-related reviews and other reviews of an app, we observe that UI-related reviews have lower ratings than other reviews. By manually analyzing a random sample of 1,447 UI-related reviews with a 95% confidence level and a 5% interval, we identify 17 UI-related issues types that belong to four categories (i.e., “Appearance,” “Interaction,” “Experience,” and “Others” ). In these issue types, we find “Generic Review” is the most occurring one. “Comparative Review” and “Advertisement” are the most negative two UI issue types. Faced with these UI issues, we explore the patterns of interaction between app owners and users. We identify eight patterns of how app owners dialogue with users about UI issues by the review-response mechanism. We find “Apology or Appreciation” and “Information Request” are the most two frequent patterns. We find updating UI timely according to feedback is essential to satisfy users. Besides, app owners could also fix UI issues without updating UI, especially for issue types belonging to “Interaction” category. Our findings show that there exists a positive impact if app owners could actively interact with users to improve UI quality and boost users’ satisfactoriness about the UIs.
Article
Full-text available
UI design is an integral part of software development. For many developers who do not have much UI design experience, exposing them to a large database of real-application UI designs can help them quickly build up a realistic understanding of the design space for a software feature and get design inspirations from exist-ing applications. However, existing keyword-based, image-similarity-based, and component-matching-based methods cannot reliably find relevant high-fidelity UI designs in a large database alike to the UI wireframe that the developers sketch, in face of the great variations in UI designs. In this article, we propose a deep-learning-based UI design search engine to fill in the gap. The key innovation of our search engine is to train a wireframe image autoencoder using a large database of real-application UI designs, without the need for labeling relevantUI designs. We implement our approach for Android UI design search, and conduct extensive experiments with artificially created relevant UI designs and human evaluation of UI design search results. Our experi-ments confirm the superior performance of our search engine over existing image-similarity or component-matching-based methods and demonstrate the usefulness of our search engine in real-world UI design tasks.
Article
Full-text available
In this paper, we study the app store as a phenomenon from the developers perspective to investigate the extent to which app stores affect software engineering tasks. Through developer interviews and questionnaires, we uncover findings that highlight and quantify the effects of three high-level app store themes: bridging the gap between developers and users, increasing market transparency and affecting mobile release management. Our findings have implications for testing, requirements engineering and mining software repositories research fields. These findings can help guide future research in supporting mobile app developers through a deeper understanding of the app store-developer interaction.
Article
Full-text available
Developers always focus on delivering high-quality updates to improve, or maintain the rating of their apps. Prior work has studied user reviews by analyzing all reviews of an app. However, this app-level analysis misses the point that users post reviews to provide their feedback on a certain update. For example, two bad updates of an app with a history of good updates would not be spotted using app-level analysis. In this paper, we examine reviews at the update-level to better understand how users perceive bad updates. We focus our study on the top 250 bad updates (i.e., updates with the highest increase in the percentage of negative reviews relative to the prior updates of the app) from 26,726 updates of 2,526 top free-to-download apps in the Google Play Store. We find that feature removal and UI issues have the highest increase in the percentage of negative reviews. Bad updates with crashes and functional issues are the most likely to be fixed by a later update. However, developers often do not mention these fixes in the release notes. Our work demonstrates the necessity of an update-level analysis of reviews to capture the feelings of an app's user-base about a particular update.
Conference Paper
Full-text available
A typical way to design and develop a mobile app is to sketch the graphical user interfaces (GUIs) for the different screens in the app and then create actual GUIs from these sketches. Doing so involves identifying which layouts to use, which widgets to add, and how to configure and connect the different pieces of the GUI. To help with this difficult and time-consuming task, we propose GUIFetch, a technique that takes as input the sketch for an app and leverages the growing number of open source apps in public repositories to identify apps with GUIs and transitions that are similar to those in the provided sketch. GUIFetch first searches public repositories to find potential apps using keyword matching. It then builds models of the identified apps' screens and screen transitions using a combination of static and dynamic analyses and computes a similarity metric between the models and the provided sketch. Finally, GUIFetch ranks the identified apps (or parts thereof) based on their computed similarity value and produces a visual ranking of the results together with the code of the corresponding apps. We implemented GUIFetch for Android apps and evaluated it through user studies involving different types of apps.
Article
Full-text available
The inception of a mobile app often takes form of a mock-up of the Graphical User Interface (GUI), represented as a static image delineating the proper layout and style of GUI widgets that satisfy requirements. Following this initial mock-up, the design artifacts are then handed off to developers whose goal is to accurately implement these GUIs and the desired functionality in code. Given the sizable abstraction gap between mock-ups and code, developers often introduce mistakes related to the GUI that can negatively impact an app's success in highly competitive marketplaces. Moreover, such mistakes are common in the evolutionary context of rapidly changing apps. This leads to the time-consuming and laborious task of design teams verifying that each screen of an app was implemented according to intended design specifications. This paper introduces a novel, automated approach for verifying whether the GUI of a mobile app was implemented according to its intended design. Our approach resolves GUI-related information from both implemented apps and mock-ups and uses computer vision techniques to identify common errors in the implementations of mobile GUIs. We implemented this approach for Android in a tool called GVT and carried out both a controlled empirical evaluation with open-source apps as well as an industrial evaluation with designers and developers from Huawei. The results show that GVT solves an important, difficult, and highly practical problem with remarkable efficiency and accuracy and is both useful and scalable from the point of view of industrial designers and developers. The tool is currently used by over one-thousand industrial designers and developers at Huawei to improve the quality of their mobile apps.
Article
Full-text available
It is common practice for developers of user-facing software to transform a mock-up of a graphical user interface (GUI) into code. This process takes place both at an application's inception and in an evolutionary context as GUI changes keep pace with evolving features. Unfortunately, this practice is challenging and time-consuming. In this paper, we present an approach that automates this process by enabling accurate prototyping of GUIs via three tasks: detection, classification, and assembly. First, logical components of a GUI are detected from a mock-up artifact using either computer vision techniques or mock-up metadata. Then, software repository mining, automated dynamic analysis, and deep convolutional neural networks are utilized to accurately classify GUI-components into domain-specific types (e.g., toggle-button). Finally, a data-driven, K-nearest-neighbors algorithm generates a suitable hierarchical GUI structure from which a prototype application can be automatically assembled. We implemented this approach for Android in a system called ReDraw. Our evaluation illustrates that ReDraw achieves an average GUI-component classification accuracy of 91% and assembles prototype applications that closely mirror target mock-ups in terms of visual affinity while exhibiting reasonable code structure. Interviews with industrial practitioners illustrate ReDraw's potential to improve real development workflows.
Article
Full-text available
App Store Analysis studies information about applications obtained from app stores. App stores provide a wealth of information derived from users that would not exist had the applications been distributed via previous software deployment methods. App Store Analysis combines this non-technical information with technical information to learn trends and behaviours within these forms of software repositories. Findings from App Store Analysis have a direct and actionable impact on the software teams that develop software for app stores, and have led to techniques for requirements engineering, release planning, software design, security and testing. This survey describes and compares the areas of research that have been explored thus far, drawing out common aspects, trends and directions future research should take to address open problems and challenges.
Preprint
Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors' apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.
Preprint
User feedback is becoming an increasingly important source of information for requirements engineering, user interface design, and software engineering in general. Nowadays, user feedback is largely available and easily accessible in social media, product forums, or app stores. Over the last decade, research has shown that user feedback can help software teams: a) better understand how users are actually using specific product features and components, b) faster identify, reproduce, and fix defects, and b) get inspirations for improvements or new features. However, to tap the full potential of feedback, there are two main challenges that need to be solved. First, software vendors must cope with a large quantity of feedback data, which is hard to manage manually. Second, vendors must also cope with a varying quality of feedback as some items might be uninformative, repetitive, or simply wrong. This chapter summarises and pipelines various data mining, machine learning, and natural language processing techniques, including recent Large Language Models, to cope with the quantity and quality challenges. We guide researchers and practitioners through implementing effective, actionable analysis of user feedback for software and requirements engineering.
Article
Being able to describe any user interface (UI) screenshot in natural language can promote understanding of the main purpose of the UI, yet currently it cannot be accomplished with state-of-the-art captioning systems. We introduce XUI, a novel method inspired by the global precedence effect to create informative descriptions of UIs, starting with an overview and then providing fine-grained descriptions about the most salient elements. XUI builds upon computational models for topic classification, visual saliency prediction, and natural language generation. XUI provides descriptions with up to three different granularity levels that, together, describe what is in the interface and what the user can do with it. We found that XUI descriptions are highly readable, are perceived to accurately describe the UI, and score similarly to human-generated UI descriptions. XUI is available as open source software.
Conference Paper
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve downstream task performance becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, and then we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide researchers with synthesis and pointer to related research.
Article
Before developing a new mobile app, the development team usually endeavors painstaking efforts to review many existing apps with similar purposes. The review process is crucial in the sense that it reduces market risks and provides inspirations for app development. However, manual exploration of hundreds of existing apps by different roles (e.g., product manager, UI/UX designer, developer, and tester) can be ineffective. For example, it is difficult to completely explore all the functionalities of the app from different aspects including design, implementation, and testing in a short period of time. However, existing reverse engineering tools only provide basic features such as AndroidManifest.xml and Java source files for users. Following the conception of storyboard in movie production, we propose a system, named StoryDistiller, to automatically generate the storyboards for Android apps with rich features through reverse engineering, and assist different roles to review and analyze apps effectively and efficiently. Specifically, we (1) propose a hybrid method to extract a relatively complete Activity transition graph (ATG), that is, it first extracts the ATG of Android apps through static analysis method first, and further leverages dynamic component exploration to augment ATG; (2) extract the required inter-component communication (ICC) data of each target Activity by leveraging static data-flow analysis and renders UI pages dynamically by using app instrumentation together with the extracted required ICC data; (3) obtain rich features including comprehensive ATG with rendered UI pages, semantic activity names, corresponding logic and layout code, etc. (4) implement the storyboard visualization as a web service with the rendered UI pages and the corresponding rich features. Our experiments unveil that StoryDistiller is effective and indeed useful to assist app exploration and review. We also conduct a comprehensive comparison study to demonstrate better performance over IC3, Gator, Stoat, and StoryDroid.
Article
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled and unlabeled data. By storing knowledge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a variety of downstream tasks, which has been extensively demonstrated via experimental verification and empirical analysis. It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we comprehensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions: designing effective architectures, utilizing rich contexts, improving computational efficiency, and conducting interpretation and theoretical analysis. Finally, we discuss a series of open problems and research directions of PTMs, and hope our view can inspire and advance the future study of PTMs.
Article
With the vast number of apps and the complexity of their features, it is becoming challenging for teachers to select a suitable learning app for their courses. Several evaluation frameworks have been proposed in the literature to assist teachers with this selection. The iPAC framework is a well-established mobile learning framework highlighting the learners’ experience of personalization, authenticity, and collaboration (iPAC). In this article, we introduce an approach to automate the identification and comparison of iPAC relevant apps. We experiment with natural language processing and machine learning techniques, using data from the app description and app reviews publicly available in app stores. We further empirically validate the keyword base of the iPAC framework based on the app users’ language in app reviews. Our approach automatically identifies iPAC relevant apps with promising results ( F 1 score \sim 72%) and evaluates them similarly as domain experts (Spearman's rank correlation 0.54). We discuss how our findings can be useful for teachers, students, and app vendors.
Chapter
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.
Article
Online communities like Dribbble and GraphicBurger allow GUI designers to share their design artwork and learn from each other. These design sharing platforms are important sources for design inspiration, but our survey with GUI designers suggests additional information needs unmet by existing design sharing platforms. First, designers need to see the practical use of certain GUI designs in real applications, rather than just artworks. Second, designers want to see not only the overall designs but also the detailed design of the GUI components. Third, designers need advanced GUI design search abilities (e.g., multi-facets search) and knowledge discovery support (e.g., demographic investigation, cross-company design comparison). This paper presents Gallery D.C. http://mui-collection.herokuapp.com/, a gallery of GUI design components that harness GUI designs crawled from millions of real-world applications using reverse-engineering and computer vision techniques. Through a process of invisible crowdsourcing, Gallery D.C. supports novel ways for designers to collect, analyze, search, summarize and compare GUI designs on a massive scale. We quantitatively evaluate the quality of Gallery D.C. and demonstrate that Gallery D.C. offers additional support for design sharing and knowledge discovery beyond existing platforms.
Article
App stores are highly competitive markets, sometimes offering dozens of apps for a single use case. Unexpected app changes such as a feature removal might incite even loyal users to explore alternative apps. Sentiment analysis tools can help monitor users’ emotions expressed, e.g., in app reviews or tweets. We found that these emotions include four recurring patterns corresponding to the app releases. Based on these patterns and online reports about popular apps, we derived five release lessons to assist app vendors maintain positive emotions and gain competitive advantages.
Conference Paper
Sketches and real-world user interface examples are frequently used in multiple stages of the user interface design process. Unfortunately, finding relevant user interface examples, especially in large-scale datasets, is a highly challenging task because user interfaces have aesthetic and functional properties that are only indirectly reflected by their corresponding pixel data and meta-data. This paper introduces Swire, a sketch-based neural-network-driven technique for retrieving user interfaces. We collect the first large-scale user interface sketch dataset from the development of Swire that researchers can use to develop new sketch-based data-driven design interfaces and applications. Swire achieves high performance for querying user interfaces: for a known validation task it retrieves the most relevant example as within the top-10 results for over 60% of queries. With this technique, for the first time designers can accurately retrieve relevant user interface examples with free-form sketches natural to their design workflows. We demonstrate several novel applications driven by Swire that could greatly augment the user interface design process.
Conference Paper
Recently, researchers have developed black-box approaches to mine design and interaction data from mobile apps. Although the data captured during this interaction mining is descriptive, it does not expose the design semantics of UIs: what elements on the screen mean and how they are used. This paper introduces an automatic approach for generating semantic annotations for mobile app UIs. Through an iterative open coding of 73k UI elements and 720 screens, we contribute a lexical database of 25 types of UI components, 197 text button concepts, and 135 icon classes shared across apps. We use this labeled data to learn code-based patterns to detect UI components and to train a convolutional neural network that distinguishes between icon classes with 94% accuracy. To demonstrate the efficacy of our approach at scale, we compute semantic annotations for the 72k unique UIs in the Rico dataset, assigning labels for 78% of the total visible, non-redundant elements.
Conference Paper
Millennials prefer learning-by-doing and social learning, and project-based learning. At the same time, software engineering educators should take into account that software development projects require not only technical skills but also creativity---Design Thinking can help in achieving the latter. We conducted a workshop, following the Design Thinking approach of the d.school, to help students generate ideas for a mobile app development project course. In addition to the details for implementing the workshop, we report our observations and lessons learned. Moreover, we provide suggestions for further implementation of Design Thinking in similar courses.
Conference Paper
A GUI skeleton is the starting point for implementing a UI design image. To obtain a GUI skeleton from a UI design image, developers have to visually understand UI elements and their spatial layout in the image, and then translate this understanding into proper GUI components and their compositions. Automating this visual understanding and translation would be beneficial for bootstraping mobile GUI implementation, but it is a challenging task due to the diversity of UI designs and the complexity of GUI skeletons to generate. Existing tools are rigid as they depend on heuristically-designed visual understanding and GUI generation rules. In this paper, we present a neural machine translator that combines recent advances in computer vision and machine translation for translating a UI design image into a GUI skeleton. Our translator learns to extract visual features in UI images, encode these features' spatial layouts, and generate GUI skeletons in a unified neural network framework, without requiring manual rule development. For training our translator, we develop an automated GUI exploration method to automatically collect large-scale UI data from real-world applications. We carry out extensive experiments to evaluate the accuracy, generality and usefulness of our approach.
Article
We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code is available at https://github.com/loshchil/AdamW-and-SGDW
Conference Paper
Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.7k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 72k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query- by-example search over UIs.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Conference Paper
The first step to comprehend and fix a software bug is usually to reproduce the corresponding failure. Reproducing a failure requires information about steps to reproduce, i.e. the steps necessary to make a failure occur in the development environment. In case of an application with a user interface, steps to reproduce consist of the interactions between a user and the application that precede the failure. Unfortunately, bug reports typically lack this information. Users are either unaware of its importance to developers, are unable to describe it, or simply do not have time to report it. In this paper, we present a simple but effective and resource efficient approach to monitor interactions between users and their applications selectively at a high level of abstraction, e.g. editing operations and commands. This minimizes the monitoring overhead and enables developers to analyze user interaction traces. We map monitored interactions to a taxonomy of user interactions to help developers comprehend user behavior. Further, we present the Timeline Tool that visualizes monitored interaction traces preceding failures. To evaluate our approach we conducted an experiment with 12 participants and asked them to reproduce bug reports from an open-source project. We found that developers are able to derive steps to reproduce from monitored interaction traces. In particular, inexperienced developers profit from the Timeline Tool, as they are able to reproduce failures that they cannot reproduce without it. The monitoring overhead is rather small (approx. 5% CPU and 2-5% memory) and users feel it does not influence their work in a negative way.
Conference Paper
Current requirements engineering practices for gathering user input are characterized by a number of communication gaps between users and engineers, which might lead to wrong requirements. The problem situations and context which underlie user input are either gathered back in time, or submitted with wrong a level of details. We think that making user input a first order concern of both software processes and software systems harbours many innovation opportunities. We propose and discuss a continuous and context-aware approach for communicating user input to engineering teams and other users, by a) instrumenting the problem domain, b) proactively recommending to share feedback and c) annotating graphical interfaces.
Conference Paper
If software engineering tools are not "properly integrated", they can reduce engineers' productivity. Associating and retrieving information scattered across the tools become unsystematic and inefficient. Our work provides empirical evidence on what is a "poor" and a "proper" tool integration, focusing on practitioners' perspectives. We interviewed 62 engineers and analyzed the content of their project artifacts. We identified problem situations and practices related to tool integration. Engineers agreed that tool integration approaches must support change, heterogeneity and automatic linking of change to context. To quantify our results, we conducted a field experiment with 27 and a survey with 782 subjects. We found a strong correlation between change frequency and preferred integration approaches. Particularly in projects with short release cycles, tasks should be used to link information handled by different tools. We also found that half of engineers' work is not defined as tasks. Therefore, a context-based tool integration approach is more effective than a task-based one.
Article
Consider a set S of n data points in real d-dimensional space, Rd, where distances are measured using any Minkowski metric. In nearest neighbor searching, we preprocess S into a data structure, so that given any query point q ∈ Rd, is the closest point of 5 to q can be reported quickly. Given any positive real ∈, a data point p is a (1 + ∈)-approximate nearest neighbor of q if its distance from q is within a factor of (1 + ∈) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set_of n points in Rd in O(dn log n) time and O(dn) space, so that given a query point q ∈ Rd, and ∈ > 0, a (1 + ∈)-approximate nearest neighbor of q can be computed in O(cd,∈ log n) time, where cd,∈ ≤ d ⌈1 + 6d/∈⌉d is a factor depending only on dimension and ∈. In general, we show that given an integer k ≥ 1, (1 + ∈)-approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.