Walid Maalej’s research while affiliated with Hamburg University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (158)


Explaining Explanations: An Empirical Study of Explanations in Code Reviews
  • Article

December 2024

·

7 Reads

·

2 Citations

ACM Transactions on Software Engineering and Methodology

Ratnadira Widyasari

·

·

·

[...]

·

Code reviews are central for software quality assurance. Ideally, reviewers should explain their feedback to enable authors of code changes to understand the feedback and act accordingly. Different developers might need different explanations in different contexts. Therefore, assisting this process first requires understanding the types of explanations reviewers usually provide. The goal of this paper is to study the types of explanations used in code reviews and explore the potential of Large Language Models (LLMs), specifically ChatGPT, in generating these specific types. We extracted 793 code review comments from Gerrit and manually labeled them based on whether they contained a suggestion, an explanation, or both. Our analysis shows that 42% of comments only include suggestions without explanations. We categorized the explanations into seven distinct types including rule or principle, similar examples, and future implications. When measuring their prevalence, we observed that some explanations are used differently by novice and experienced reviewers. Our manual evaluation shows that, when the explanation type is specified, ChatGPT can correctly generate the explanation in 88 out of 90 cases. This foundational work highlights the potential for future automation in code reviews, which can assist developers in sharing and obtaining different types of explanations as needed, thereby reducing back-and-forth communication.


GUing: A Mobile GUI Search Engine using a Vision-Language Model

November 2024

·

15 Reads

·

3 Citations

ACM Transactions on Software Engineering and Methodology

Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e. labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.


Fig. 1: User Interface of UX-LLM.
Fig. 6: Venn Diagram showing the overlap of usability issues identified by usability testings, expert reviews, and UX-LLM.
Does GenAI Make Usability Testing Obsolete?
  • Preprint
  • File available

November 2024

·

55 Reads

Ensuring usability is crucial for the success of mobile apps. Usability issues can compromise user experience and negatively impact the perceived app quality. This paper presents UX-LLM, a novel tool powered by a Large Vision-Language Model that predicts usability issues in iOS apps. To evaluate the performance of UX-LLM we predicted usability issues in two open-source apps of a medium complexity and asked usability experts to assess the predictions. We also performed traditional usability testing and expert review for both apps and compared the results to those of UX-LLM. UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38, indicating its ability to identify valid usability issues, yet failing to capture the majority of issues. Finally, we conducted a focus group with an app development team of a capstone project developing a transit app for visually impaired persons. The focus group expressed positive perceptions of UX-LLM as it identified unknown usability issues in their app. However, they also raised concerns about its integration into the development workflow, suggesting potential improvements. Our results show that UX-LLM cannot fully replace traditional usability evaluation methods but serves as a valuable supplement particularly for small teams with limited resources, to identify issues in less common user paths, due to its ability to inspect the source code.

Download

Fig. 2: VS Code extensions applied in the experiment.
Can Developers Prompt? A Controlled Experiment for Code Documentation Generation

Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a Chat-GPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.



Multilingual Crowd-Based Requirements Engineering Using Large Language Models

September 2024

·

3 Reads

A central challenge for ensuring the success of software projects is to assure the convergence of developers’ and users’ views. While the availability of large amounts of user data from social media, app store reviews, and support channels bears many benefits, it still remains unclear how software development teams can effectively use this data. We present an LLM-powered approach called DeeperMatcher that helps agile teams use crowd-based requirements engineering (CrowdRE) in their issue and task management.We are currently implementing a command-line tool that enables developers to match issues with relevant user reviews. We validated our approach on an existing English dataset from a well-known open-source project. Additionally, to check how well DeeperMatcher works for other languages, we conducted a single-case mechanism experiment alongside developers of a local project that has issues and user feedback in Brazilian Portuguese. Our preliminary analysis indicates that the accuracy of our approach is highly dependent on the text embedding method used. We discuss further refinements needed for reliable crowd-based requirements engineering with multilingual support.


Getting Inspiration for Feature Elicitation: App Store- vs. LLM-based Approach

August 2024

·

16 Reads

·

1 Citation

Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors' apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.


Figure 1: Core components of the proposed architecture. In a darker shade of gray, we highlight the components that receive data from both user reviews and issues. Every component depicted can be switched or adapted for the needs of specific teams or projects.
Figure 2: Screenshot of matches identified by DeeperMatcher when prompted with a review from Table III of the DeepMatcher proof of concept [7].
Figure 3: Suggested issues for a review requesting a new app screen: Although the issue related to the creation of the new screen is not listed, the first suggested match is a fix for a problem with the existing screen.
Multilingual Crowd-Based Requirements Engineering Using Large Language Models

August 2024

·

26 Reads

A central challenge for ensuring the success of software projects is to assure the convergence of developers' and users' views. While the availability of large amounts of user data from social media, app store reviews, and support channels bears many benefits, it still remains unclear how software development teams can effectively use this data. We present an LLM-powered approach called DeeperMatcher that helps agile teams use crowd-based requirements engineering (CrowdRE) in their issue and task management. We are currently implementing a command-line tool that enables developers to match issues with relevant user reviews. We validated our approach on an existing English dataset from a well-known open-source project. Additionally, to check how well DeeperMatcher works for other languages, we conducted a single-case mechanism experiment alongside developers of a local project that has issues and user feedback in Brazilian Portuguese. Our preliminary analysis indicates that the accuracy of our approach is highly dependent on the text embedding method used. We discuss further refinements needed for reliable crowd-based requirements engineering with multilingual support.


Can Developers Prompt? A Controlled Experiment for Code Documentation Generation

August 2024

·

13 Reads

Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.


On the Automated Processing of User Feedback

July 2024

·

33 Reads

·

3 Citations

User feedback is becoming an increasingly important source of information for requirements engineering, user interface design, and software engineering in general. Nowadays, user feedback is largely available and easily accessible in social media, product forums, or app stores. Over the last decade, research has shown that user feedback can help software teams: a) better understand how users are actually using specific product features and components, b) faster identify, reproduce, and fix defects, and b) get inspirations for improvements or new features. However, to tap the full potential of feedback, there are two main challenges that need to be solved. First, software vendors must cope with a large quantity of feedback data, which is hard to manage manually. Second, vendors must also cope with a varying quality of feedback as some items might be uninformative, repetitive, or simply wrong. This chapter summarises and pipelines various data mining, machine learning, and natural language processing techniques, including recent Large Language Models, to cope with the quantity and quality challenges. We guide researchers and practitioners through implementing effective, actionable analysis of user feedback for software and requirements engineering.


Citations (55)


... The app introduction images on Google Play are a gold mine for the inspiration of app design, particularly the Graphical User Interface (GUI), as they are carefully selected by app developers to represent the important features of the apps. Recent researches mines the app introduction images and proposed GUI search engines, such as Gallery D.C. [8,15], and GUing [52], to facilitate the search of existing app UI designs using textual queries. Recently, Wei et al. discussed how LLM-Inspiration can be combined with GUI-Mining with the app designer in the loop [49]. ...

Reference:

Getting Inspiration for Feature Elicitation: App Store- vs. LLM-based Approach
GUing: A Mobile GUI Search Engine using a Vision-Language Model
  • Citing Article
  • November 2024

ACM Transactions on Software Engineering and Methodology

... Another guiding principle of our proposed approach is to support the requirements engineering process while maintaining a human in the loop [1]. While leveraging LLMs, we aim to mitigate the risk of fully relying on the inherently imperfect nature of artificial intelligence predictions in the software development process. ...

Design Patterns for Machine Learning Based Systems with Human-in-the-Loop
  • Citing Article
  • January 2023

IEEE Software

... This gets reflected in the diversity of feedback and submitting users. Demographic attributes like primary language, experience, culture, communication channels, and even personality traits [6] might have an impact on the feedback content and linguistic properties as recent research shows [25,20]. Therefore, whenever possible, such metadata about submitting users should be collected or extracted from the text. ...

Exploring the Relationship Between Personality Traits and User Feedback
  • Citing Conference Paper
  • September 2023

... Similarly, researchers [31,32,36,42] also conducted semi-structured interviews for requirements elicitation. Additionally, there are studies [47][48][49]52] concentrated on requirements elicitation that prioritize stakeholders' perspectives and needs. ...

Tailoring Requirements Engineering for Responsible AI
  • Citing Article
  • April 2023

Computer

... Empirical research on software documentation quality is an active field that focuses on various artifacts, like API reference documentation [40] or README files [68], and the perspectives of documentation writers [1]. Studies on the evaluation of AI-generated documentation usually focus on automated metrics like BLEU, ROUGE, and METEOR [29], [64], [69]- [71]. ...

An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects

... For example, Nicholson and Guo [33] encoded both the textual data (title and description, using TF-IDF or fastText embeddings trained by Wikipedia, StackOverflow, or project-specific documents) and the metadata (issue type, reporter identifier and assignee identifier using one-hot encoding). Lüders et al. [20], on the other hand, chose to encode issues only using their title and description (through general BERT model 15 ) considering those are universal features across different issue trackers. Then, a pair of artifacts can be represented as a combination of the features of individual artifacts, with optionally other features on the relations of the two involved artifacts (such as differences in creation time between the two issues). ...

Automated Detection of Typed Links in Issue Trackers
  • Citing Conference Paper
  • August 2022

... into a query or an issue report, to derive an appropriate reaction strategy. A bot would also immediately engage with the user to collect missing details thus improving the feedback quality whenever needed [48,88]. In case of an inquiry, e.g., on how to perform a certain task with the app or how to configure the system for enabling or disabling certain features, the bot can try to answer directly or delegate to the support teams if not possible or not satisfactory. ...

A Chatbot for the Elicitation of Contextual Information from User Feedback
  • Citing Conference Paper
  • August 2022

... Moreover, having a skilled customer or a product owner write good user stories is often far from reality for many software projects. It is thus common to see teams just keeping track of the tasks to be implemented and issues to be resolved [17,18]. The convenience of focusing on recording development and maintenance tasks comes with the risk that the development tasks and the actual needs of the users may drift apart. ...

An alternative issue tracking dataset of public Jira repositories
  • Citing Conference Paper
  • October 2022

... A simple solution is to choose one pair in which two responses ignifying a notable disparity in terms of the reward score (Pang et al., 2024). Yet, it can not bypass the confirmation bias problem (Andersen and Maalej, 2022;Rizve et al., 2021;Wang et al., 2021) in the self-training-like paradigm. ...

Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers
  • Citing Conference Paper
  • January 2022