Preprint

Getting Inspiration for Feature Elicitation: App Store- vs. LLM-based Approach

Authors:
  • EuroMov Digital Health in Motion IMT Mines Alès Univ. Montpellier
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors' apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

Article
Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e. labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.
Conference Paper
Full-text available
The emergence of large language models and conversational front-ends such as ChatGPT is revolutionizing many software engineering activities. The extent to which such technologies can help with requirements engineering activities, especially the ones surrounding modeling, however, remains to be seen. This paper reports on early experimental results on the potential use of GPT-4 in the latter context, with a focus on the development of goal-oriented models. We first explore GPT-4's current knowledge and mastering of a specific modeling language, namely the Goal-oriented Requirement Language (GRL). We then use four combinations of prompts – with and without a proposed textual syntax, and with and without contextual domain knowledge – to guide the creation of GRL models for two case studies. The first case study focuses on a well-documented topic in the goal modeling community (Kids Help Phone), whereas the second one explores a context for which, to our knowledge, no public goal models currently exist (Social Housing). We explore the interactive construction of a goal model through specific follow-up prompts aimed to fix model issues and expand on the model content. Our results suggest that GPT-4 preserves considerable knowledge on goal modeling, and although many elements generated by GPT-4 are generic, reflecting what is already in the prompt, or even incorrect, there is value in getting exposed to the generated concepts, many of which being non-obvious to stakeholders outside the domain. Furthermore, aggregating results from multiple runs yields a far better outcome than from any individual run.
Conference Paper
Full-text available
Abstract—Recent work has recognized the importance of developing and deploying software systems that reflect human values and has explored different approaches for eliciting these values from stakeholders. However, prior studies have also shown that it can be challenging for stakeholders to specify a diverse set of product-related human values. In this paper we therefore explore the use of ChatGPT for generating user stories that describe candidate human values. These generated stories provide inspiration to stakeholder discussions and enrich the human- created user stories. We engineer a series of ChatGPT prompts to retrieve a list of common stakeholders and candidate features for a targeted product, and then, for each pairwise combination of role and feature, and for each individual Schwartz value, we issue an additional prompt to generate a candidate user story reflecting that value. We present the candidate user-stories to stakeholders and, as part of a creative requirements engineering session, we ask them to assess and prioritize the generated user- stories, and then use them as inspiration for discussing and specifying their own product-related human values. Through conducting a series of focus groups we compare the human- values created by stakeholders with and without the benefit of the ChatGPT examples. Results are evaluated with respect to coverage of values, clarity of expression, internal completeness, and through feedback from our participants. Results from our analysis show that the ChatGPT-generated user stories are able to provide creativity triggers that help stakeholders to specify human values for a product. Index Terms—Human values, Creative requirements elicita- tion, User stories
Article
Full-text available
Understanding users’ needs is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms. This study evaluates machine learning classifiers’ performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback channels and/or labelled by different researchers). We also measure the difference in performance of using channel-specific metadata as a feature in classification. We find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We also demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers evaluated do not perform well on unseen datasets. We show that multi-dataset training or zero shot classification approaches can somewhat mitigate this performance decrease. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements.
Preprint
Full-text available
We are concerned by Data Driven Requirements Engineering, and in particular the consideration of user's reviews. These online reviews are a rich source of information for extracting new needs and improvement requests. In this work, we provide an automated analysis using CamemBERT, which is a state-of-the-art language model in French. We created a multi-label classification dataset of 6000 user reviews from three applications in the Health & Fitness field. The results are encouraging and suggest that it's possible to identify automatically the reviews concerning requests for new features. Dataset is available at: https://github.com/Jl-wei/APIA2022-French-user-reviews-classification-dataset.
Article
Full-text available
Research has repeatedly shown that high-quality requirements are essential for the success of development projects. While the term “quality” is pervasive in the field of requirements engineering and while the body of research on requirements quality is large, there is no meta-study of the field that overviews and compares the concrete quality attributes addressed by the community. To fill this knowledge gap, we conducted a systematic mapping study of the scientific literature. We retrieved 6905 articles from six academic databases, which we filtered down to 105 relevant primary studies. The primary studies use empirical research to explicitly define, improve, or evaluate requirements quality. We found that empirical research on requirements quality focuses on improvement techniques, with very few primary studies addressing evidence-based definitions and evaluations of quality attributes. Among the 12 quality attributes identified, the most prominent in the field are ambiguity, completeness, consistency, and correctness. We identified 111 sub-types of quality attributes such as “template conformance” for consistency or “passive voice” for ambiguity. Ambiguity has the largest share of these sub-types. The artefacts being studied are mostly referred to in the broadest sense as “requirements”, while little research targets quality attributes in specific types of requirements such as use cases or user stories. Our findings highlight the need to conduct more empirically grounded research defining requirements quality, using more varied research methods, and addressing a more diverse set of requirements types.
Article
Full-text available
Given the competitive mobile app market, developers must be fully aware of users’ needs, satisfy users’ requirements, combat apps of similar functionalities (i.e., competing apps), and thus stay ahead of the competition. While it is easy to track the overall user ratings of competing apps, such information fails to provide actionable insights for developers to improve their apps over the competing apps (AlSubaihin et al., IEEE Trans Softw Eng, 1–1, 2019). Thus, developers still need to read reviews from all their interested competing apps and summarize the advantages and disadvantages of each app. Such a manual process can be tedious and even infeasible with thousands of reviews posted daily. To help developers compare users’ opinions among competing apps on high-level features, such as the main functionalities and the main characteristics of an app, we propose a review analysis approach named FeatCompare. FeatCompare can automatically identify high-level features mentioned in user reviews without any manually annotated resource. Then, FeatCompare creates a comparative table that summarizes users’ opinions for each identified feature across competing apps. FeatCompare features a novel neural network-based model named G lobal-L ocal sensitive F eature E xtractor (GLFE), which extends Attention-based Aspect Extraction (ABAE), a state-of-the-art model for extracting high-level features from reviews. We evaluate the effectiveness of GLFE on 480 manually annotated reviews sampled from five groups of competing apps. Our experiment results show that GLFE achieves a precision of 79%-82% and recall of 74%-77% in identifying the high-level features associated with reviews and outperforms ABAE by 14.7% on average. We also conduct a case study to demonstrate the usage scenarios of FeatCompare. A survey with 107 mobile app developers shows that more than 70% of developers agree that FeatCompare is of great benefit.
Article
Full-text available
Context Code-free software similarity detection techniques have been used to support different software engineering tasks, including clustering mobile applications (apps). The way of measuring similarity may affect both the efficiency and quality of clustering solutions. However, there has been no previous comparative study of feature extraction methods used to guide mobile app clustering. Objective In this paper, we investigate different techniques to compute the similarity of apps based on their textual descriptions and evaluate their effectiveness using hierarchical agglomerative clustering. Method To this end we carry out an empirical study comparing five different techniques, based on topic modelling and keyword feature extraction, to cluster 12,664 apps randomly sampled from the Google Play App Store. The comparison is based on three main criteria: silhouette width measure, human judgement and execution time. Results The results of our study show that using topic modelling, in addition to collocation-based and dependency-based feature extractors perform similarly in detecting app-feature similarity. However, dependency-based feature extraction performs better than any other in finding application domain similarity (ρ = 0.7,p − value < 0.01). Conclusions Current categorisation in the app store studied does not exhibit a good classification quality in terms of the claimed feature space. However, a better quality can be achieved using a good feature extraction technique and a traditional clustering method.
Chapter
Full-text available
[Context & Motivation] App store reviews are a rich source for analysts to elicit requirements from user feedback, for they describe bugs to be fixed, requested features, and possible improvements. Product development teams need new techniques that help them make real-time decisions based on user feedback. [Question/Problem] Researchers have proposed natural language processing (NLP) techniques for extracting and organizing requirements-relevant knowledge from the reviews for one specific app. However, no attention has been paid to studying whether and how requirements can be identified from competing products. [Principal ideas/results] We propose RE-SWOT, a tool-supported method for eliciting requirements from app store reviews through competitor analysis. RE-SWOT combines NLP algorithms with information visualization techniques. We evaluate the usefulness of RE-SWOT with expert product managers from three mobile app companies. [Contribution] Our preliminary results show that competitor analysis is a promising path for research that has direct impact on the requirements engineering practice in modern app development companies.
Article
Full-text available
As the number of Android applications (apps) is increasing dramatically, users face a serious problem to find relevant apps to their needs. Therefore, there is an important demand for app search engines or recommendation services where developing an accurate similarity method is a challenging issue. Contrary to malware detection, very fewer efforts have been devoted to similarity computation of apps. Furthermore, all the existing methods use the features obtained only from the app stores such as description and rating, which could be inaccurate, varied in different stores, and affected by language barrier; they totally neglect useful information clearly capturing the app’s functionalities and behaviors that can be mined from the apps themselves such as the API calls and manifest information. In this paper, we propose an effective method called SimAndro to compute the similarity of apps, which extracts the features based on the information obtained only from apps themselves and the Android platform without using information obtained from third-party sources such as app stores. SimAndro performs both feature extraction and similarity computation where the API calls, manifest information, package name, and strings are used as features. To compute the similarity score of an app-pair, a separate similarity score is computed based on each feature, and a weighted linear combination of these four scores is regarded as the final similarity score by utilizing an automatic weighting scheme based on TreeRankSVM. The results of extensive experiments with three real-world datasets and a dataset constructed by human experts demonstrate the effectiveness of SimAndro.
Article
Full-text available
Apps have concentrated sale platforms, in which there often exist some products similar as the new App to be developed. The main features of these products are given in their introductions, and this provides an important resource for developers to improve the quality of the requirements of their own App. In this paper, we propose an approach to gain and recommend requirements related information from App descriptions to help developers use the resource efficiently. Firstly, we construct a model by mining domain knowledge from App descriptions with the method proposed in our previous work, and use initial requirements to retrieve their related information from the model. Then, we analyze the information and recommend them from three aspects: static information of existing Apps for identifying the priorities of requirements; functional features and non-functional properties of features for giving detailed design of the Apps in requirements; and combinations of features for enriching the requirements. To validate the proposed approach, we conducted experiments and a survey based on the data in Google Play. The results show that our approach can identify the existing products with initial requirements reasonably, and also indicate that the developers confirm the usefulness of recommended information in practice.
Article
Full-text available
App features are one of the most important factors that people consider when choosing apps. In order to satisfy users’ needs and attract their eyes, deciding what features should be added in next release becomes very important. Different from traditional requirement elimination, app stores provide a new platform for developers to gather requirements and perform market-wide analysis. Considering that software features provided to users can be found out by exploring existing apps, an important way to elicit requirements is analyzing existing features provided by products which offer related functions and then finding new trends and fashions promptly. In this context, we propose a data-driven approach for recommending software features of mobile applications based on user interface comparison. Our approach mines similar user interfaces (UIs) from publicly available online repository. To calculate UI similarity through the best matches of components of two UIs, text similarity is used to measure the similarity of UI components and genetic algorithm is introduced to improve the comparison efficiency. Then, we develop an algorithm to extract features from similar UIs based on a set of identification rules. These features are further clustered with text similarity algorithm and finally recommended to developers. The approach is empirically validated with 44 features from 10 UIs. The experiment results indicate that our recommended features are valuable for requirement elicitation.
Article
Full-text available
Given the increasing competition in mobile-app ecosystems, improving the user experience has become a major goal for app vendors. App Store 2.0 will exploit crowdsourced information about apps, devices, and users to increase the overall quality of the delivered mobile apps. App Store 2.0 generates different kinds of actionable feedback from the crowd information. This feedback helps developers deal with potential errors that could affect their apps before publication or even when the apps are in the users' hands. The App Store 2.0 vision has been transformed into a concrete implementation for Android devices. This article is part of a special issue on Crowdsourcing for Software Engineering.
Article
Full-text available
App stores like Google Play and Apple AppStore have over 3 million apps covering nearly every kind of software and service. Billions of users regularly download, use, and review these apps. Recent studies have shown that reviews written by the users represent a rich source of information for the app vendors and the developers, as they include information about bugs, ideas for new features, or documentation of released features. The majority of the reviews, however, is rather non-informative just praising the app and repeating to the star ratings in words. This paper introduces several probabilistic techniques to classify app reviews into four types: bug reports, feature requests, user experiences, and text ratings. For this, we use review metadata such as the star rating and the tense, as well as, text classification, natural language processing, and sentiment analysis techniques. We conducted a series of experiments to compare the accuracy of the techniques and compared them with simple string matching. We found that metadata alone results in a poor classification accuracy. When combined with simple text classification and natural language preprocessing of the text—particularly with bigrams and lemmatization—the classification precision for all review types got up to 88–92 % and the recall up to 90–99 %. Multiple binary classifiers outperformed single multiclass classifiers. Our results inspired the design of a review analytics tool, which should help app vendors and developers deal with the large amount of reviews, filter critical reviews, and assign them to the appropriate stakeholders. We describe the tool main features and summarize nine interviews with practitioners on how review analytics tools including ours could be used in practice.
Article
Full-text available
App stores allow users to submit feedback for downloaded apps in form of star ratings and text reviews. Recent studies analyzed this feedback and found that it includes information useful for app developers, such as user requirements, ideas for improvements, user sentiments about specific features, and descriptions of experiences with these features. However, for many apps, the amount of reviews is too large to be processed manually and their quality varies largely. The star ratings are given to the whole app and developers do not have a mean to analyze the feedback for the single features. In this paper we propose an automated approach that helps developers filter, aggregate, and analyze user reviews. We use natural language processing techniques to identify fine-grained app features in the reviews. We then extract the user sentiments about the identified features and give them a general score across all reviews. Finally, we use topic modeling techniques to group fine-grained features into more meaningful high-level features. We evaluated our approach with 7 apps from the Apple App Store and Google Play Store and compared its results with a manually, peer-conducted analysis of the reviews. On average, our approach has a precision of 0.59 and a recall of 0.51. The extracted features were coherent and relevant to requirements evolution tasks. Our approach can help app developers to systematically analyze user opinions about single features and filter irrelevant reviews.
Conference Paper
Full-text available
Application distribution platforms - or app stores - such as Google Play or Apple AppStore allow users to submit feedback in form of ratings and reviews to downloaded applications. In the last few years, these platforms have become very popular to both application developers and users. However, their real potential for and impact on requirements engineering processes are not yet well understood. This paper reports on an exploratory study, which analyzes over one million reviews from the Apple AppStore.We investigated how and when users provide feedback, inspected the feedback content, and analyzed its impact on the user community. We found that most of the feedback is provided shortly after new releases, with a quickly decreasing frequency over time. Reviews typically contain multiple topics, such as user experience, bug reports, and feature requests. The quality and constructiveness vary widely, from helpful advices and innovative ideas to insulting offenses. Feedback content has an impact on download numbers: positive messages usually lead to better ratings and vice versa. Negative feedback such as shortcomings is typically destructive and misses context details and user experience. We discuss our findings and their impact on software and requirements engineering teams.
Conference Paper
Full-text available
Many different types of models are used in various scientific and engineering fields, reflecting the subject matter and the kinds of understanding that is sought in each field. Conceptual modeling techniques in software and information systems engineering have in the past focused mainly on describing and analyzing behaviours and structures that are implementable in software. As software systems become ever more complex and densely intertwined with the human social environment, we need models that reflect the social characteristics of complex systems. This chapter reviews the approach taken by the i* framework, highlights its application in several areas, and outlines some open research issues.
Article
Graphical User Interfaces (GUIs) are central to app development projects. App developers may use the GUIs of other apps as a means of requirements refinement and rapid prototyping or as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and lack app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which display the most representative screenshots and are often captioned (i.e. labelled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind for GUI retrieval. We evaluated our approach on various datasets from related work and in a manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.
Article
The development and deployment of systems using supervised machine learning (ML) remain challenging: mainly due to the limited reliability of prediction models and the lack of knowledge on how to effectively integrate human intelligence into automated decision-making. Humans involvement in the ML process is a promising and powerful paradigm to overcome the limitations of pure automated predictions and improve the applicability of ML in practice. We compile a catalog of design patterns to guide developers select and implement suitable humanin-the-loop (HiL) solutions. Our catalog takes into consideration key requirements as the cost of human involvement and model retraining. It includes four training patterns, four operation patterns, and two orthogonal cooperation patterns.
Article
To attract and retain users, deciding what features should be added in the next release of apps becomes very crucial. Different from traditional software, there are rich data resources in app markets to perform market-wide analysis. Considering that capturing key features that apps lack compared with its similar products and making up for them can be conducive to enhance the competitiveness, we propose a method to establish the feature relationships from the level of UI pages and recommend missing key features for the pages of apps based on these relationships. Firstly, we utilize the UI testing tool to collect UI pages for apps in the repository, and give the method to gain the feature information in them. Then, we identify the products similar to the analyzed app based on topic modeling technique. Finally, we establish the relationships between features by analyzing UI pages gained for the analyzed app as well as its similar products, and identify suitable features recommended to UI pages of the analyzed app based on these relationships. The experiment based on Google Play shows that our method can recommend features for apps from the level of UI pages effectively.
Article
With the rapid increase in the number of Apps, the requirement of users has also become extremely complex. Developers have to continuously acquire innovative requirements that provide the guideline for developing more competitive products. However, traditional methods to acquire requirements are not suitable for the App development due to the disadvantage that it cannot interact with users directly. Besides, some methods that use text and data analysis to acquire requirements automatically are hard to expand innovative products because they are often confined to the specific App or the same domain. Therefore, to attract more new users, developers try to find new portable inspiration from other domains for enriching the functions of the App. In this article, we propose a feature extraction method from the descriptions of Apps and use similarity matching to acquire cross‐domain requirements. Our experiments have verified that the Precision, the Recall, and the F‐measure are all as high as 80% of our feature extraction method. Besides, the requirements list we recommend also makes a good performance in terms of reusability with the average Reuse Rank of 59.33% and average Adjusted Functional Points of 7.49, the adaptability gets an average score of 3.3, and the average score of operability is 3.
Article
Online communities like Dribbble and GraphicBurger allow GUI designers to share their design artwork and learn from each other. These design sharing platforms are important sources for design inspiration, but our survey with GUI designers suggests additional information needs unmet by existing design sharing platforms. First, designers need to see the practical use of certain GUI designs in real applications, rather than just artworks. Second, designers want to see not only the overall designs but also the detailed design of the GUI components. Third, designers need advanced GUI design search abilities (e.g., multi-facets search) and knowledge discovery support (e.g., demographic investigation, cross-company design comparison). This paper presents Gallery D.C. http://mui-collection.herokuapp.com/, a gallery of GUI design components that harness GUI designs crawled from millions of real-world applications using reverse-engineering and computer vision techniques. Through a process of invisible crowdsourcing, Gallery D.C. supports novel ways for designers to collect, analyze, search, summarize and compare GUI designs on a massive scale. We quantitatively evaluate the quality of Gallery D.C. and demonstrate that Gallery D.C. offers additional support for design sharing and knowledge discovery beyond existing platforms.
Article
The rapidly evolving mobile applications (apps) have brought great demand for developers to identify new features by inspecting the descriptions of similar apps and acquire missing features for their apps. Unfortunately, due to the huge number of apps, this manual process is time-consuming and unscalable. To help developers identify new features, we propose a new approach named SAFER. In this study, we first develop a tool to automatically extract features from app descriptions. Then, given an app, we leverage the topic model to identify its similar apps based on the extracted features and API names of apps. Finally, we design a feature recommendation algorithm to aggregate and recommend the features of identified similar apps to the specified app. Evaluated over a collection of 533 annotated features from 100 apps, SAFER achieves a Hit@15 score of up to 78.68% and outperforms the baseline approach KNN+ by 17.23% on average. In addition, we also compare SAFER against a typical technique of recommending features from user reviews, i.e., CLAP. Experimental results reveal that SAFER is superior to CLAP by 23.54% in terms of Hit@15.
Conference Paper
Play Store and App Store have a large number of apps that are in competition as they share a fair amount of common features. User reviews of apps contain important information such as feature evaluation, bug report and feature request, which is useful input for the improvement of app quality. Automatic extraction and summarization of this information could offer app developers opportunities for understanding the strengths and weaknesses of their app and/or prioritizing the app features for the next release cycle. To support these goals, we developed the tool REVSUM which automatically identifies developer-relevant information from reviews, such as reported bugs or requested features. Then, app features are extracted automatically from these reviews using the recently proposed rule based approach SAFE. Finally, a summary is generated that supports the application of the following three use cases: (1) view users' sentiments about app features in competing apps, (2) detect which summarized app features were mentioned in bug related reviews, and (3) identify new app features requested by users.
Article
App stores are highly competitive markets, sometimes offering dozens of apps for a single use case. Unexpected app changes such as a feature removal might incite even loyal users to explore alternative apps. Sentiment analysis tools can help monitor users’ emotions expressed, e.g., in app reviews or tweets. We found that these emotions include four recurring patterns corresponding to the app releases. Based on these patterns and online reports about popular apps, we derived five release lessons to assist app vendors maintain positive emotions and gain competitive advantages.
Article
Compared with traditional software, the domain analysis of apps is conducted not only in the early stage of software development to gain knowledge of a particular domain but also runs throughout each iteration of apps to help developers understand evolution trends of the domain for maintaining their competitiveness. In this paper, we propose an approach to analyze app descriptions combined with reviews in App stores automatically and construct a feature‐based domain state model (FDSM) in the form of state machine to support the domain analysis of apps. In FDSM, the domain knowledge up to a certain moment together is defined as a state. Initial state summarizes the high‐level knowledge by gaining topics of app descriptions, whereas each transition is generated based on the information gained within one period of time and describes the change from the current state to the next one. Furthermore, user opinions in reviews are introduced into the model to quantify the value of information for helping developers get key domain knowledge efficiently. To validate the proposed approach, we conducted a series of experiments based on Google Play. The results show that FDSM can provide valuable information for supporting domain analysis, especially in the evolution process of apps.
Article
The market for mobile apps is getting bigger and bigger, and it is expected to be worth over 100 Billion dollars in 2020. To have a chance to succeed in such a competitive environment, developers need to build and maintain high-quality apps, continuously astonishing their users with the coolest new features. Mobile app marketplaces allow users to release reviews. Despite reviews are aimed at recommending apps among users, they also contain precious information for developers, reporting bugs and suggesting new features. To exploit such a source of information, developers are supposed to manually read user reviews, something not doable when hundreds of them are collected per day. To help developers dealing with such a task, we developed CLAP (Crowd Listener for releAse Planning), a web application able to (i) categorize user reviews based on the information they carry out, (ii) cluster together related reviews, and (iii) prioritize the clusters of reviews to be implemented when planning the subsequent app release. We evaluated all the steps behind CLAP, showing its high accuracy in categorizing and clustering reviews and the meaningfulness of the recommended prioritizations. Also, given the availability of CLAP as a working tool, we assessed its applicability in industrial environments. IEEE
Article
Domain analysis aims at gaining knowledge to a particular domain in the early stage of software development. A key challenge in domain analysis is to extract features automatically from related product artifacts. Compared with other kinds of artifacts, high volume of descriptions can be collected from App marketplaces (such as Google Play and Apple Store) easily when developing a new mobile application (App), so it is essential for the success of domain analysis to gain features and relationships from them using data analysis techniques. In this paper, we propose an approach to mine domain knowledge from App descriptions automatically, where the information of features in a single App description is firstly extracted and formally described by a Concern-based Description Model (CDM), which is based on pre-defined rules of feature extraction and a modified topic modeling method; then the overall knowledge in the domain is identified by classifying, clustering and merging the knowledge in the set of CDMs and topics, and the results are formalized by a Data-based Raw Domain Model (DRDM). Furthermore, we propose a quantified evaluation method for prioritizing the knowledge in DRDM. The proposed approach is validated by a series of experiments.
Conference Paper
Continuous Delivery (CD) enables mobile developers to release small, high quality chunks of working software in a rapid manner. However, faster delivery and a higher software quality do neither guarantee user satisfaction nor positive business outcomes. Previous work demonstrates that app reviews may contain crucial information that can guide developer's software maintenance efforts to obtain higher customer satisfaction. However, previous work also proves the difficulties encountered by developers in manually analyzing this rich source of data, namely (i) the huge amount of reviews an app may receive on a daily basis and (ii) the unstructured nature of their content. In this paper, we propose SURF (Summarizer of User Reviews Feedback), a tool able to (i) analyze and classify the information contained in app reviews and (ii) distill actionable change tasks for improving mobile applications. Specifically, SURF performs a systematic summarization of thousands of user reviews through the generation of an interactive, structured and condensed agenda of recommended software changes. An end-to-end evaluation of SURF, involving 2622 reviews related to 12 different mobile applications, demonstrates the high accuracy of SURF in summarizing user reviews content. In evaluating our approach we also involve the original developers of some apps, who confirm the practical usefulness of the software change recommendations made by SURF. Demo URL: https://youtu.be/Yf-U5ylJXvo Demo webpage:
Article
We have located the results of empirical studies on elicitation techniques and aggregated these results to gather empirically grounded evidence. Our chosen surveying methodology was systematic review, whereas we used an adaptation of comparative analysis for aggregation because meta-analysis techniques could not be applied. The review identified 564 publications from the SCOPUS, IEEEXPLORE, and ACM DL databases, as well as Google. We selected and extracted data from 26 of those publications. The selected publications contain 30 empirical studies. These studies were designed to test 43 elicitation techniques and 50 different response variables. We got 100 separate results from the experiments. The aggregation generated 17 pieces of knowledge about the interviewing, laddering, sorting, and protocol analysis elicitation techniques. We provide a set of guidelines based on the gathered pieces of knowledge.
Conference Paper
If software engineering tools are not "properly integrated", they can reduce engineers' productivity. Associating and retrieving information scattered across the tools become unsystematic and inefficient. Our work provides empirical evidence on what is a "poor" and a "proper" tool integration, focusing on practitioners' perspectives. We interviewed 62 engineers and analyzed the content of their project artifacts. We identified problem situations and practices related to tool integration. Engineers agreed that tool integration approaches must support change, heterogeneity and automatic linking of change to context. To quantify our results, we conducted a field experiment with 27 and a survey with 782 subjects. We found a strong correlation between change frequency and preferred integration approaches. Particularly in projects with short release cycles, tasks should be used to link information handled by different tools. We also found that half of engineers' work is not defined as tasks. Therefore, a context-based tool integration approach is more effective than a task-based one.
Advancing Requirements Engineering through Generative AI: Assessing the Role of LLMs
  • Chetan Arora
  • John Grundy
  • Mohamed Abdelrazek
Chetan Arora, John Grundy, and Mohamed Abdelrazek. 2023. Advancing Requirements Engineering through Generative AI: Assessing the Role of LLMs. https://doi.org/10.48550/arXiv.2310.13976 arXiv:2310.13976 [cs].
Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation
  • Mohammadmehdi Ataei
  • Hyunmin Cheong
  • Daniele Grandi
  • Ye Wang
  • Nigel Morris
Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, et al. 2024. Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation. https://doi.org/10.48550/arXiv.2404.16045 arXiv:2404.16045 [cs].
  • Beatriz Cabrero-Daniel
  • Tomas Herda
  • Victoria Pichler
  • Martin Eder
Beatriz Cabrero-Daniel, Tomas Herda, Victoria Pichler, and Martin Eder. 2024. Exploring Human-AI Collaboration in Agile: Customised LLM Meeting Assistants. https://doi.org/10.48550/arXiv.2404.14871 arXiv:2404.14871 [cs].
  • Josh Openai
  • Steven Achiam
  • Sandhini Adler
  • Lama Agarwal
  • Ahmad
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. 2024. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs].
App competition matters: How to identify your competitor apps
  • Qiang Md Kafil Uddin
  • Jun He
  • Caslon Han
  • Chua
Md Kafil Uddin, Qiang He, Jun Han, and Caslon Chua. 2020. App competition matters: How to identify your competitor apps?. In Proceedings -2020 IEEE 13th International Conference on Services Computing, SCC 2020. 370-377. https: Getting Inspiration for Feature Elicitation: App Store-vs. LLM-based Approach ASE'24, October 27 -November 1, 2024, Sacramento, California, USA.