ArticlePublisher preview available

On user rationale in software engineering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Rationale refers to the reasoning and justification behind human decisions, opinions, and beliefs. In software engineering, rationale management focuses on capturing design and requirements decisions and on organizing and reusing project knowledge. This paper takes a different view on rationale written by users in online reviews. We studied 32,414 reviews for 52 software applications in the Amazon Store. Through a grounded theory approach and peer content analysis, we investigated how users argue and justify their decisions, e.g., about upgrading, installing, or switching software applications. We also studied the occurrence frequency of rationale concepts such as issues encountered or alternatives considered in the reviews and found that assessment criteria like performance, compatibility, and usability represent the most pervasive concept. We identified a moderate positive correlation between issues and criteria and furthermore assessed the distribution of rationale concepts with respect to rating and verbosity. We found that issues tend to appear more in lower star rated reviews, while criteria, alternatives, and justifications seem to appear more in three star rated reviews. Also, reviews reporting alternatives seem to be more verbose than reviews reporting criteria. A follow-up qualitative study of sub-concepts revealed, that users also report other alternatives (e.g., alternative software provider), criteria (e.g., cost), and decisions (e.g., on rating software). We then used the truth set of manually labeled review sentences to explore how accurately we can mine rationale concepts from the reviews. We evaluated the classification algorithms Naive Bayes, Support Vector Machine, Logistic Regression, Decision Tree, Gaussian Process, Random Forest, and Multilayer Perceptron Classifier using a baseline and random configuration. Support Vector Classifier, Naive Bayes, and Logistic Regression, trained on the review metadata, syntax tree of the review text, and influential terms, achieved a precision around 80% for predicting sentences with alternatives and decisions, with top recall values of 98%. On the review level, precision was up to 13% higher with recall values reaching 99%. Using only word features, we achieved in most cases the highest precision and highest recall respectively using the Random Forest and Naive Bayes algorithm. We discuss the findings and the rationale importance for supporting deliberation in user communities and synthesizing the reviews for developers.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
Requirements Engineering (2018) 23:357–379
https://doi.org/10.1007/s00766-018-0293-2
RE 2017
On user rationale insoftware engineering
ZijadKurtanović1 · WalidMaalej1
Received: 21 October 2017 / Accepted: 23 March 2018 / Published online: 6 April 2018
© Springer-Verlag London Ltd., part of Springer Nature 2018
Abstract
Rationale refers to the reasoning and justification behind human decisions, opinions, and beliefs. In software engineering,
rationale management focuses on capturing design and requirements decisions and on organizing and reusing project knowl-
edge. This paper takes a different view on rationale written by users in online reviews. We studied 32,414 reviews for 52
software applications in the Amazon Store. Through a grounded theory approach and peer content analysis, we investigated
how users argue and justify their decisions, e.g., about upgrading, installing, or switching software applications. We also
studied the occurrence frequency of rationale concepts such as issues encountered or alternatives considered in the reviews
and found that assessment criteria like performance, compatibility, and usability represent the most pervasive concept. We
identified a moderate positive correlation between issues and criteria and furthermore assessed the distribution of rationale
concepts with respect to rating and verbosity. We found that issues tend to appear more in lower star rated reviews, while
criteria, alternatives, and justifications seem to appear more in three star rated reviews. Also, reviews reporting alternatives
seem to be more verbose than reviews reporting criteria. A follow-up qualitative study of sub-concepts revealed, that users
also report other alternatives (e.g., alternative software provider), criteria (e.g., cost), and decisions (e.g., on rating software).
We then used the truth set of manually labeled review sentences to explore how accurately we can mine rationale concepts
from the reviews. We evaluated the classification algorithms Naive Bayes, Support Vector Machine, Logistic Regression,
Decision Tree, Gaussian Process, Random Forest, and Multilayer Perceptron Classifier using a baseline and random con-
figuration. Support Vector Classifier, Naive Bayes, and Logistic Regression, trained on the review metadata, syntax tree of
the review text, and influential terms, achieved a precision around 80% for predicting sentences with alternatives and deci-
sions, with top recall values of 98%. On the review level, precision was up to 13% higher with recall values reaching 99%.
Using only word features, we achieved in most cases the highest precision and highest recall respectively using the Random
Forest and Naive Bayes algorithm. We discuss the findings and the rationale importance for supporting deliberation in user
communities and synthesizing the reviews for developers.
Keywords App analytics· Rationale· Review mining
1 Introduction
According to Merriam-Webster [40], rationale is “the expla-
nation of controlling principles of opinion, belief, practice,
or phenomena, or an underlying reason.” Rationale man-
agement focuses on capturing and sharing the reasons and
justifications behind decisions. The design rationale soft-
ware design paradigm aims at capturing design decisions
and their justifications [4]. Over the last decades, managing
requirements and design rationale has been a major concern
in software engineering [5, 12, 23]. Ideally, rationale should
be captured in requirements and design artifacts to docu-
ment why certain project decisions were taken [13]. This
includes the questions or issues encountered by designers
and analysts, the alternatives explored to solve the issues,
and the criteria to evaluate the alternatives [4]. Rationale is
also often found in informal artifacts such as team conversa-
tions or sketches [28].
With the increasing popularity of social media, user
forums, and app stores, software vendors started giving
more attention to the input of users when making deci-
sions about software design, development, and evolution
* Zijad Kurtanović
kurtanovic@informatik.uni-hamburg.de
Walid Maalej
maalej@informatik.uni-hamburg.de
1 University ofHamburg, Hamburg, Germany
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... To our knowledge, there are no clear thresholds for precision, recall, and F-score values in the existing literature that automatic rationale identification should at least achieve to be effective. Kurtanović and Maalej (2018) filtered all results having precision, recall, or the F1-score below 0.6. Usually, researchers compare their best-achieved precision, recall, and F-scores with related work. ...
... ConDec's automatic text classification is the only one that also classifies commit messages. Kurtanović and Maalej (2018) classify user rationale, whereas the other approaches (including ConDec) classify developer rationale. ...
... All except one approach by Kurtanović and Maalej (2018) perform a binary classification to detect text parts with rationale, decisions, or issues before the more detailed (fine-grained) classification (Table 10.7). The binary classification resulted in best F1-scores between 0.68 to 0.95 (for ConDec 0.81). ...
Thesis
Full-text available
Continuous Software Engineering (CSE) is a software life cycle model open to frequent changes in requirements or technology. During CSE, software developers continuously make decisions on the requirements and design of the software or the development process. They establish essential decision knowledge, which they need to document and share so that it supports the evolution and changes of the software. The management of decision knowledge is called rationale management. Rationale management provides an opportunity to support the change process during CSE. However, rationale management is not well integrated into CSE. The overall goal of this dissertation is to provide workflows and tool support for continuous rationale management. The dissertation contributes an interview study with practitioners from the industry, which investigates rationale management problems, current practices, and features to support continuous rationale management beneficial for practitioners. Problems of rationale management in practice are threefold: First, documenting decision knowledge is intrusive in the development process and an additional effort. Second, the high amount of distributed decision knowledge documentation is difficult to access and use. Third, the documented knowledge can be of low quality, e.g., outdated, which impedes its use. The dissertation contributes a systematic mapping study on recommendation and classification approaches to treat the rationale management problems. The major contribution of this dissertation is a validated approach for continuous rationale management consisting of the ConRat life cycle model extension and the comprehensive ConDec tool support. To reduce intrusiveness and additional effort, ConRat integrates rationale management activities into existing workflows, such as requirements elicitation, development, and meetings. ConDec integrates into standard development tools instead of providing a separate tool. ConDec enables lightweight capturing and use of decision knowledge from various artifacts and reduces the developers' effort through automatic text classification, recommendation, and nudging mechanisms for rationale management. To enable access and use of distributed decision knowledge documentation, ConRat defines a knowledge model of decision knowledge and other artifacts. ConDec instantiates the model as a knowledge graph and offers interactive knowledge views with useful tailoring, e.g., transitive linking. To operationalize high quality, ConRat introduces the rationale backlog, the definition of done for knowledge documentation, and metrics for intra-rationale completeness and decision coverage of requirements and code. ConDec implements these agile concepts for rationale management and a knowledge dashboard. ConDec also supports consistent changes through change impact analysis. The dissertation shows the feasibility, effectiveness, and user acceptance of ConRat and ConDec in six case study projects in an industrial setting. Besides, it comprehensively analyses the rationale documentation created in the projects. The validation indicates that ConRat and ConDec benefit CSE projects. Based on the dissertation, continuous rationale management should become a standard part of CSE, like automated testing or continuous integration.
... Intent identification involves determining customer intent based on online user reviews (Khattak et al. 2021). Also, previous studies demonstrate that crowd comments from end-users mobile stores (Sarro et al. 2018), Tweets , and posts (Khan et al. 2020a;Morales-Ramirez et al. 2017) include important and well-informed information regarding the user's interaction (Bakiu and Guzman 2017), requests for new features (Panichella et al. 2015;Dhinakaran et al. 2018), and end-user rationale information (Kurtanović and Maalej 2018;Jarczyk et al. 1992), which is used to make informed requirements decision-making (Khan et al. 2019b(Khan et al. , 2020b. ...
... Finally, the least prominent rationale element in the data set is the decision, which is 4.7% (563) end-user comments of an overall 11,936 end-user comments annotation sample. With the decision rationale element, software developers and requirements engineers can take pivotal requirements-related decisions to improve the performance of the existing software applications, i.e., why certain crowd-users ended up leaving the software application (Kurtanović and Maalej 2018;Khan et al. 2019b). Additionally, while analyzing the end-user comments in the Amazon store, we found that crowd-user register negative comments against the software application, while its corresponding rating score is either 5 or 4, representing supporting-claim. ...
... In particular, MLP, Voting, LSTM, and RF ML and DL algorithms perform relatively better and predicted higher precision, recall, and F-measure values for the distant rationale elements, i.e., supporting, decision, attacking, neutral, and issues identified by the proposed approach, as shown in Table 4. Based on the experimental results shown in Table 4, we conclude that either MLP, Voting, LSTM, or RF can be selected as the best ML or DL classifier to identify various rationale elements in the crowd-user's comments in the social media platform. Furthermore, the proposed approach outperforms previous similar research approaches (Kurtanović and Maalej 2018;Khan et al. 2020b;Alkadhi et al. 2017a) regarding classification accuracy, precision, recall, and F-measure. It can be seen in Table 4, we achieved higher accuracy, precision, recall, and F-measure values than the previous rationale mining approaches. ...
Article
Full-text available
Rationale refers to making human judgments, sets of reasons, or intentions to explain a particular decision. Nowadays, crowd-users argue and justify their decisions on social media platforms about market-driven software applications, thus generating a software rationale. Such rationale information can be of pivotal importance for the software and requirements engineers to enhance the performance of existing software applications by revealing end-users tactic knowledge to improve software designing and development decision-making. For this purpose, we proposed an automated approach to capture and analyze end-user reviews containing rationale information, focusing on low-rating applications in the amazon store using Natural Language Processing (NLP) and supervised machine learning (ML) classification methods. In the literature, high-rating applications have been emphasized while ignoring low-rating software application that causes potential biasness. Therefore, we examined 59 comparatively low-ranked market-based software applications from the Amazon app store covering various software categories to capture and identify crowd-users justifications. Next, using a developed grounded theory and content analysis approach, we studied and recorded how crowd-users analyze and explain their rationale based on issues encountered, attacking or supporting arguments registered, and updating or uninstalling software applications. Also, to achieve the best results, an experimental study is conducted by comparing various ML and deep learning (DL) algorithms, i.e., MNB, LR, RF, MLP, KNN, AdaBoost, Voting, LSTM, and BILSTM classifiers on the end-users rationale data set by preprocessing the input data, applying feature engineering, balancing the data set, and then training and testing the ML algorithms with a standard cross-validation approach. We obtained satisfactory results with MLP, voting, LSTM, and RF Classifiers, having 93%, 93%, 91%, and 90% average accuracy, respectively. Also, we plot the ROC curves for the high-performing ML and DL classifiers to identify and capture classifiers yielding the best performance with an under-sampling or oversampling balancing approach. The proposed research approach outerperforms the existing rationale approaches with better Precision, Recall, and F-measure values. Additionally, the paper discusses various aspects of the proposed approach by extending it in multiple directions to be utilized by software developers and vendors to help improve the performance of existing software applications.
... According to Statista, 1 app stores such as Google Play and App Store are considered popular platforms among internet users [1,2]. These platforms provide a large volume of user feedback that represents valuable information, which can be analyzed to support decision making in software engineering [27]. The proliferation of available user feedback is extensively offers the opportunity to address information in novel way that need at decision-making in different software tasks, ranging from requirements engineering to maintenance and release planning. ...
Article
Full-text available
In the era of Internet access, software is easily available on digital distribution platforms such as app stores. The distribution of software on these platforms makes user feedback more accessible and can be used from requirements engineering to software maintenance context. However, such user reviews might contain technical information about the app that can be valuable for developers and software companies. Due to pervasive use of mobile apps, a large amount of data is created by users on daily basis. Manual identification and classification of such reviews are time-consuming and laborious tasks. Hence, automating this process is essential for assisting developers in managing these reviews efficiently. Prior studies have focused on classification of these reviews into bug reports, user experience, and feature requests. Nevertheless to date, a very few research papers have extracted Non-Functional Requirements (NFRs) present in these reviews. NFRs are considered as the set of quality attributes such as reliability, performance, security and usability of the software. Previous studies have utilized machine learning techniques to classify these reviews into their respective classes. However, it was observed that existing studies treat review classification problems as single-label classification problem, and also underestimate the contextual relationship between the words of review statements. To alleviate this limitation, the proposed research work used a transfer learning model to classify multi-label app reviews into four NFRs: Dependability, Performance, Supportability, and Usability. The proposed approach evaluates the performance of the pre-trained language model for multi-label review classification. In this paper, a set of experiments are conducted to compare the performance of the proposed model against the baseline machine learning with binary relevance and keyword based approach. We evaluated our approach over a dataset of 6000 user reviews of 24 iOS apps. Experimental results show that the proposed model outperforms state-of-the-art baseline techniques with respect to precision, recall, and F1-measure.
... [1], [3], [29]. Data mining and analysis for tweets [31], [32], Amazon Product Review [33] and Product reviews and description [34] has focused on requirement engineering too. Stanik et al. [35] worked on classifying multilingual user feedback (with single CNN architecture and single word embedding) and before him Maalej et al. [36] tried to automatically classify app reviews but did not take Neural Network based architecture into account. ...
Preprint
Full-text available
The energy inefficiency of the apps can be a major issue for the app users which is discussed on App Stores extensively. Previous research has shown the importance of investigating the energy related app reviews to identify the major causes or categories of energy related user feedback. However, there is no study that efficiently extracts the energy related app reviews automatically. In this paper, we empirically study different techniques for automatic extraction of the energy related user feedback. We compare the accuracy, F1-score and run time of numerous machine-learning models with relevant feature combinations and relatively modern Neural Network-based models. In total, 60 machine learning models are compared to 30 models that we build using six neural network architectures and three word embedding models. We develop a visualization tool for this study through which a developer can traverse through this large-scale result set. The results show that neural networks outperform the other machine learning techniques and can achieve the highest F1-score of 0.935. To replicate the research results, we have open sourced the interactive visualization tool. After identifying the best results and extracting the energy related reviews, we further compare various techniques to help the developers automatically investigate the emerging issues that might be responsible for energy inefficiency of the apps. We experiment the previously used string matching with results obtained from applying two of the state-of-the-art topic modeling algorithms, OBTM and AOLDA. Finally, we run a qualitative study performed in collaboration with developers and students from different institutions to determine their preferences for identifying necessary topics from previously categorized reviews, which shows OBTM produces the most helpful results.
... To provide insights about how our approach faired against the state-of-the-art, we selected a relatively recent approach by Kurtanovic and Maalej [30] which is described extensively in the original paper and shows Since the original classifier was not publicly available, and the authors could not provide us with the code as a revised version was being developed as part of a research project, we reconstructed it from the details provided in the original publication. We also complemented such information with a code of another classifier related to app reviews [31] that is partially available online and developed by the same research group. ...
Preprint
Full-text available
[Context] Study replication is essential for theory building and empirical validation. [Problem] Despite its empirical vocation, requirements engineering (RE) research has given limited attention to study replication, threatening thereby the ability to verify existing results and use previous research as a baseline. [Solution] In this perspective paper, we -- a group of experts in natural language processing (NLP) for RE -- reflect on the challenges for study replication in NLP for RE. Concretely: (i) we report on hands-on experiences of replication, (ii) we review the state-of-the-art and extract replication-relevant information, and (iii) we identify, through focus groups, challenges across two typical dimensions of replication: data annotation and tool reconstruction. NLP for RE is a research area that is suitable for study replication since it builds on automated tools which can be shared, and quantitative evaluation that enable direct comparisons between results. [Results] Replication is hampered by several factors, including the context specificity of the studies, the heterogeneity of the tasks involving NLP, the tasks' inherent hairiness, and, in turn, the heterogeneous reporting structure. To address these issues, we propose an ID card whose goal is to provide a structured summary of research papers, with an emphasis on replication-relevant information. [Contribution] We contribute in this study with: (i) a set of reflections on replication in NLP for RE, (ii) a set of recommendations for researchers in the field to increase their awareness on the topic, and (iii) an ID card that is intended to primarily foster replication, and can also be used in other contexts, e.g., for educational purposes. Practitioners will also benefit from the results since replications increase confidence on research findings.
Chapter
Improving and updating applications based on user reviews is crucial to the continuous development of modern mobile applications. However, software bug descriptions in user reviews are often written by non-professional users, and contain a lot of irrelevant text, making it challenging to conduct bug localization. The current software bug localization technologies based on user reviews are not able to address these challenges effectively, resulting in suboptimal results. To address this issue, we propose ReviewLocator, which focuses on key phrases and learning from historical bug reports. It first utilizes syntactic analysis or source file parsing to convert each user review or source file into phrase representations. Then it depends on Key Phrase-based Ranking using a newly proposed Bug Report-based Term Weight to map review phrase sets to source file phrase sets. In our experiments on eight applications from the Google Play Store, the results prove our proposal surpasses ChangeAdvisor and Where2Change with an absolute improvement of 0.076 and 0.055 in terms of MAP correspondingly.
Article
Full-text available
Researchers are increasingly recognizing the importance of human aspects in software development. Because qualitative methods are used to explore human behavior in‐depth, we believe that studies using such methods will become more common. Existing qualitative software engineering guidelines do not cover the full breadth of qualitative methods and the knowledge on how to use them like in social sciences. The purpose of this study was to extend the software engineering community's current body of knowledge regarding available qualitative methods and their quality assurance frameworks and to provide recommendations and guidelines for their use. With the support of an epistemological argument and a survey of the literature, we suggest that future research would benefit from (1) utilizing a broader set of research methods, (2) more strongly emphasizing reflexivity, and (3) employing qualitative guidelines and quality criteria. We present an overview of three qualitative methods commonly used in social sciences but rarely seen in software engineering research, namely interpretative phenomenological analysis, narrative analysis, and discourse analysis. Furthermore, we discuss the meaning of reflexivity in relation to the software engineering context and suggest means of fostering it. Our paper will help software engineering researchers better select and then guide the application of a broader set of qualitative research methods.
Article
Context:: Classification of software requirements into different categories is a critically important task in requirements engineering (RE). Developing machine learning (ML) approaches for requirements classification has attracted great interest in the RE community since the 2000s. Objective:: This paper aims to address two related problems that have been challenging real-world applications of ML approaches: the problems of class imbalance and high dimensionality with low sample size data (HDLSS). These problems can greatly degrade the classification performance of ML methods. Methods:: The paper proposes HC4RC, a novel ML approach for multiclass classification of requirements. HC4RC solves the aforementioned problems through semantic-role based feature selection, dataset decomposition and hierarchical classification. We experimentally compare the effectiveness of HC4RC with three closely related approaches — two of which are based on a traditional statistical classification model whereas one using an advanced deep learning model. Results:: Our experiment shows: (1) The class imbalance and HDLSS problems present a challenge to both traditional and advanced ML approaches. (2) The HC4RC approach is simple to use and can effectively address the class imbalance and HDLSS problems compared to similar approaches. Conclusion:: This paper makes an important practical contribution to addressing the class imbalance and HDLSS problems in multiclass classification of software requirements.
Article
Full-text available
Newsrooms are still searching for ways to manage user comments because of both a desire for professional distance from their audiences and a lack of analytical tools. This paper presents findings from our exploratory, interdisciplinary study in journalism research and computer science that focuses on the algorithmic classification and clustering of user comments. In contrast to endeavours that aim at filtering out hate speech or spam, we take a more constructive approach and focus on detecting particularly useful or high-quality user contributions that can be leveraged for journalistic purposes. On the basis of a literature review and our own preliminary research on audience participation and user review analytics, we developed a mock-up of a software framework to help journalists systematically analyze user comments to this end. We then surveyed its effectiveness through two group discussions - one with comment moderators and another with editors from different editorial departments of a large German online newsroom. Features that journalists and comment moderators considered useful include the categorization of user comments in pro- and contra-arguments towards a certain topic, the automated assessment of comments' quality as well as the identification of surprising or exceptional comments and those that present new questions, arguments or viewpoints.
Article
Full-text available
Chat messages of development teams play an increasingly significant role in software development, having replaced emails in some cases. Chat messages contain information about discussed issues, considered alternatives and argumentation leading to the decisions made during software development. These elements, defined as rationale, are invaluable during software evolution for documenting and reusing development knowledge. Rationale is also essential for coping with changes and for effective maintenance of the software system. However, exploiting the rationale hidden in the chat messages is challenging due to the high volume of unstructured messages covering a wide range of topics. This work presents the results of an exploratory study examining the frequency of rationale in chat messages, the completeness of the available rationale and the potential of automatic techniques for rationale extraction. For this purpose, we apply content analysis and machine learning techniques on more than 8,700 chat messages from three software development projects. Our results show that chat messages are a rich source of rationale and that machine learning is a promising technique for detecting rationale and identifying different rationale elements.
Conference Paper
Full-text available
Context: Conducting a Grounded Theory study is rigorous, demanding, and challenging. Misperceptions exist within the software engineering community. Objective: The purpose of this paper is to describe one extended participant observation Grounded Theory study for aiding new empirical researchers wanting to run similar research studies. Method: Following Constructivist Grounded Theory, we conducted a two-year five-month participant-observation of eight software development projects at Pivotal, a software development organization; interviewed 33 software engineers, interaction designers, and product managers; and analyzed one year of retrospection topics. We iterated between analysis and theoretical sampling until achieving theoretical saturation, publishing papers on team code ownership, sustainable software development through overlapping code ownership, and software development waste. Results: This paper describes the missteps, challenges, and unique insights that occurred while conducting a Grounded Theory study. Limitations: While the results are highly relevant to the researcher, the outcomes might not apply to other researchers. Conclusion: Conducting my own Grounded Theory research study, attending Glaser’s Seminar, and reading and rereading Charmaz’s and Glaser’s books helped the researcher overcome misperceptions about Grounded Theory research.
Conference Paper
Reviews for software products contain much information about the users' requirements and preferences, which can be very useful to the requirements engineer. However, taking advantage of this information is not easy due to the large and overwhelming number of reviews that is posted in various channels. Machine learning and opinion mining techniques have therefore been used to process the reviews automatically and to generate summaries of the data to the requirements engineer. However, one of the important challenges for these techniques lies in how to automatically assess the relevance of the reviews for the requirements engineer. So far, most techniques use intuition-based criteria for this task. In this work, we collect and present a list of factors that were found to impact the helpfulness of product reviews for customers. We then discuss to what extent these factors are likely to impact the usefulness of reviews for requirements engineering tasks. The factors can be used to support the automated identification of relevant reviews.
Chapter
One of the challenges in using statistical machine learning for text mining is coming up with the right set of text features. We have developed a system that uses genetic algorithms (GAs) to evaluate candidate feature sets to classify sentences in a document. We have applied this tool to find design rationale (the reasons behind design decisions) in two different datasets to evaluate our approach for finding rationale and to see how features might differ for the same classification target in different types of data. We used Chrome bug reports and transcripts of design sessions. We found that we were able to get results with less overfitting by using a smaller set of features common to the set optimized for each document type.