ArticlePDF Available

Is It Worth Responding to Reviews? A Case Study of the Top Free Apps in the Google Play Store

Authors:

Abstract

The value of responding to a user review of a mobile app has never been explored. Our analysis of app reviews and responses from 10,713 top apps in the Google Play Store shows that developers of frequently-reviewed apps never respond to reviews. However, we observe that there are positive effects to responding to reviews (users change their ratings 38.7% of the time following a developer response) with a median increase of 20% in the rating.
Is It Worth Responding to Reviews? A Case Study
of the Top Free Apps in the Google Play Store
Stuart McIlroy, Weiyi Shang, Nasir Ali, Ahmed E. Hassan
School of Computing, Queen’s University, Canada
Department of Computer Science and Software Engineering, Concordia University, Canada
Email: {mcilroy, nasir, ahmed}@cs.queensu.ca, shang@encs.concordia.ca
Abstract—The value of responding to a user review of a mobile
app has never been explored. Our analysis of app reviews and
responses from 10,713 top apps in the Google Play Store shows
that developers of frequently-reviewed apps never respond to
reviews. However, we observe that there are positive effects to
responding to reviews (users change their ratings 38.7% of the
time following a developer response) with a median increase of
20% in the rating.
I. INT ROD UC TI ON
App stores provide feedback mechanisms for users by
allowing a user to rate an app using a five-star rating system
and to write a short review. Addressing user feedback is an
important part of developing and maintaining popularity in
app stores. A viable mechanism for addressing user feedback
is through personally responding to a particular user review.
Developers are able to respond to a complaint or thank
the user for kind remarks about the app. The response may
motivate the user to change the rating of their app review or
to write a more positive review. However, developers have
limited time. Responding to reviews takes away time that
the developer could use to enhance their app. It is not clear
how often users change their rating after a response, if at all.
Additionally, it would be beneficial to know which types of
reviews, i.e., in terms of content, are most likely to have their
rating updated if the developer were to respond to the review.
A strategic choice should be made to respond to reviews,
which are most likely to lead to a positive update of their
ratings. To the best of our knowledge, there exist no prior
research investigating how developers respond to reviews or
the value of responding to reviews.
In this paper, we empirically investigate app reviews and the
responses to the reviews from the perspective of developers of
the top apps in the Google Play Store. Through an analysis
of reviews and developer responses for reviews for the top
10,713 apps in the Google Play Store over a period of two
months, we explore the following research question:
RQ: What is the value of responding to reviews?
Developers of 13.8% of the studied apps responded to
at least one review during the studied time period. The
most-reviewed apps never responded to a review during
our study period. Users change their rating 38.7% of the
time following a developer response. The median rating
change is a one-star increase out of five.
Fig. 1. The review and response process between developers and users.
TABLE I
DATASET S OF P RIO R WO RK IN M IN ING M OBI LE R EVI EW S
Paper App Store Apps Reviews
Iacob and Harrison [1] Google Play Store 161 3,279
Galvis and Carreno [2] Google Play Store 2 710
Fu et al. [3] Google Play Store 171,493 13,286,706
Chen et al. [4] Google Play Store 4 169,097
Pagano and Maalej [5] Apple App Store 1,100 1,126,453
Takeaway Message: The results from our study suggest that
there is value in responding to user reviews. Users are likely
to update their star ratings upwards.
II. BACKGROU ND A ND RE LATE D WORK
A. Rating an App
Once a user downloads an app, the user is able to leave a
rating, a review or both. The rating and review can both be
updated.
Developers are able to respond to a review by any user. The
developer’s response is public for anyone to see (not just that
particular user). The user is then notified that the developer
has left a response. Figure 1 shows the process of rating an
app.
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
B. Related work
Previous work confirms that reviews of mobile apps have a
major impact on the success of an app [6–8]. Harman et al.
show a strong correlation between app ratings and the total
downloads of an app [6]. User reviews contain information that
could help developers improve the quality of their apps, and
increase their revenue. Kim et al.[7] conducted interviews
of app buyers and shows that reviews are one of the key
determinants in the user’s purchase of an app. Similarly,
Mudambi et al.[8] showed that user reviews have a major
impact on the sales of online products.
The importance of user reviews motivates many recent
studies on analyzing and summarizing user reviews for mobile
apps as Table I shows. A recent study by Pagano and Maalej
analyzed the content of reviews of both free and paid apps
in the Apple App Store [5]. Guzman et al.[9] identify
app features in the reviews using natural language processing
techniques and leverage sentiment analysis to identify whether
users like the features. Maalej et al.[10] propose an auto-
mated approach that automatically classifies reviews into four
categories: bug reports, feature requests, user experiences, and
ratings. Iacob and Harrison [1] built a rule-based automated
tool to extract feature requests from user reviews of mobile
apps – their approach identifies whether a user review contains
a feature request or not. Chandy and Gu identified spam
messages in reviews on the Apple App Store [11]. Carre˜
no et
al.[2] used opinion mining techniques and topic modelling
to extract requirements from user reviews. Fu et al.present an
approach that discovered inconsistencies in apps and analyzed
the negative reviews of apps using topic analysis [3]. Khalid
et al.manually analyzed and categorized one- and two-star
reviews [12]. They manually identified the different issues that
users complained about in mobile apps. Chen et al.propose
the most extensive summarization approach to date [4]. They
remove un-informative reviews, and prioritize the most infor-
mative reviews before presenting a visualization of the content
of reviews. Our work differs from these studies as we focus
on the value of responding to reviews, which has never been
studied before.
C. Mobile App Analytics
Vision Mobile performed a survey of 7,000 developers and
found that 40% of developers make use of user analytics
tools [13] and 18% use crash reporting and bug tracking
tools. Previous studies also highlight that app developers need
analytics tools. For example, Pagano and Bruegge conducted
a study on how feedback occurs after the initial release of a
product [14]. The authors concluded that there is a need to
structure and analyze feedback, particularly when it occurs in
large quantities.
Nowadays, there exists many app analytics companies, e.g.,
App Annie (http://www.appannie.com/app-store-analytics/),
that specialize in giving developers tools to understand how
users interact with the developers’ apps, how developers gen-
erate revenue (in-app purchases, e-commerce, direct buy), and
the demographics of app users. These app analytics companies
also provide developers with overviews of user feedback and
logged crash reports. Google has promoted their own extensive
analytics tools for Android developers as a key competitive
differentiator relative to other mobile stores. The tools measure
how users are using an app (e.g., identify the locations of users
and how they reached the app). The tools also track sales data
(e.g., tracking how the developer makes money through in-
app purchases and calculating the impact of promotions on
the sales of an app). However, other than crash reporting tools
much of the analytics tools available today are mostly sales
oriented instead of being software quality oriented.
III. EMP IR IC AL ST UDY DE SI GN
In this section, we present the design of our study and the
data collection and processing methods used in our study.
A. Data Selection
We focus on top apps of the Google Play Store, since
top apps often have a large amount of reviews and are
more interested in maintaining and growing their user base
(and their ratings). Our criteria for selecting an app store
is based on its popularity (the Google Play Store is one
of the most popular app stores), the ability to respond
to reviews (iOS store does not support responding to re-
views) and the availability of tools to automatically collect
information from the app store. We collected the reviews
and responses from 12,000 free-to-download apps from the
Google Play Store. Across thirty different categories, e.g.,
Photography, Sports and Education, we selected the top apps
in each category in the USA based on Distimos ranking
of apps for a total of 12,000 (Distimo ranks the top 400
apps for each of the 30 categories). Distimo is an app ana-
lytic company (http://www.distimo.com/leaderboards/google-
play-store/united-states/top-overall/free). We used Distimo’s
Spring 2013 top app list. We chose apps that were popular one
year ago because we are interested in studying stable mature
apps that had not been released recently to avoid the expected
frequent burst of reviews following the early releases of an
app [5].
B. Data Collection
We developed a crawler to extract the apps information such
as app name, user ratings, and reviews. The crawler simulates
a mobile device and interfaces with the Google Play API as
a regular mobile device. We selected the Samsung Galaxy S3
phone as our simulated device since it is one of the most
popular Android devices. We modified the crawler to only
collect the relevant information for our study. We instituted a
timer to pause the crawler to avoid issuing too many requests
and we scaled the crawler over multiple machines to distribute
the load.
We ran the crawler on a daily basis over a period of
approximately two months beginning on January 1st 2014
to March 2nd 2014. 1,287 of the 12,000 top apps were not
accessible during our crawl (e.g., some app were removed
from the store). Hence, we collected data from 10,713 top
2
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
apps. 11,047 different releases of apps were collected in the
studied time period. An app can have multiple releases. In our
dataset, on average, an app has 0.86 releases, i.e., most apps
did not publish a new release during our studied release period.
4,073 apps published at least one release. Apps that published
at least one new release had an average of 2.28 releases.
A recent study by Martin et al.[15] notes that all stores
do not provide access to all their reviews. To ensure that we
have all the available reviewers, we collected all the reviews
on daily basis. A limitation of the Google Play Store is that
only the 500 latest reviews per app are accessible. The crawler
is unable to access any older reviews. That means if more than
500 reviews occur within the 24-hour period between runs of
our crawler, then the crawler will not collect those reviews.
This limitation means that we have a conservative estimate of
the number of reviews for 20 (0.19%) apps that received more
than 500 reviews per 24 hour time period.
IV. APP ROACH
In this section, we present our approach to answer the
research question.
A. Manual analysis
Pagano and Maalej observed 17 topics of apps reviews
in App Apps store [5]. We manually labelled a statistically
representative sample of 384 reviews from the Android top
apps using the 17 topics that were observed by Pagano and
Maalej. We also manually labelled responses from the 111,099
reviews with responses that occurred during the studied time
period. Since Pagano and Maalej did not study the types of
responses, we followed an iterative process to discover the
types of responses until we could not find any additional types.
The amount of manually examined reviews and responses is
the number required for a statistical sample with a confidence
level of 95% and a confidence interval of 5%. In total, we
spent approximately 8 hours to manually analyze and label
each review and response. The third co-author reviewed the
labels for consistency. If both co-authors disagreed, they came
to a consensus, which occurred for very few reviews.
B. Automated analysis
To complement our manual analysis of responses, we per-
formed an additional automated analysis to calculate the aver-
age rating change for reviews with responses, the probability
of a rating change, and the magnitude of the rating change.
We then separated the reviews into 25 automatically generated
topics (using LDA [16]) and examined which topics were most
likely to lead to a positive change in rating. Our choice of 25
topics is motivated by a desire for general topics that are broad
and that most developers would face. Moreover, Pagano and
Maalej observed 17 topics for app reviews [5]. We chose 25
to make sure that we observed at least their 17 topics.
For 20 days from April 7th to April 27th we monitored
if reviews changed either the rating, comment or response.
We denoted a review and all subsequent changes as a review
chain. The median review-chain length is 2 (meaning one
500 5000 50000 500000 5000000 100000000
0.0 0.2 0.4 0.6 0.8 1.0
number of downloads
% responses to reviews for an app
Fig. 2. Percentage of responses to the number of total reviews for each app
separated by the number of downloads. Excluding apps with zero responses.
review and one response). The maximum review chain is 8.
We automatically analyzed 15,208 review chains in total.
V. RE SU LTS
A. Automated analysis
Most apps do not respond to reviews. Only 13.8% of 10,713
apps responded to reviews during the studied time period.
As Figure 2 shows, apps with greater number of downloads
never respond, but some apps in the mid-range of number
of downloads responded often. The apps with high response
percentages have a low number of reviews (possibly indicating
that apps with a large number of reviews are overwhelmed
already with their reviews).
Responses often lead to a positive change in review rating.
Looking at all review chains, we find that 38.7% of the
users changed their rating after a response. We also find that
the median change in rating was a positive increase of one
star (20% increase). This finding demonstrates that developers
can benefit from responding to reviews. However, since only
13.8% of the apps have responded to their reviews, developers
may not realize that responding to reviews has such a large
benefit (a positive median increment of one star). Moreover,
we find that the average star rating for the apps that do not
respond to reviews is only 1.7. Such low star rating indicates
that these apps indeed need a chance to increase their rating by
responding to reviews. Some users even updated their review
to notify the developer that the response had solved their
problem or that the user was thankful that the developer had
directly responded to them. We also find that most reviews
with responses are low-rated reviews with an average of 2.2
stars. This finding supports prior research, which targeted
negative reviews (1 and 2 star reviews [17–19]) as being
reviews of great interest and concern to developers of mobile
apps (over higher star-rating reviews).
The most common review topic that received responses,
at 8% was about crashing. The issue of crashing is a
serious one and has a great impact on the experience of
the user. It is understandable that developers focus on these
reviews. We find that the chance of a rating change for
each review topic was distributed between 15% and 42%.
The two topics with highest chance of a rating change were
about notifications and not being able to connect to the app.
3
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
TABLE III
TYP ES OF D EV ELO PER R ES PON SE S TO RE VI EWS I N OUR S AM PLE .
Response Type Description Amount
Instructions A developer provides assistance on how to use the feature or steps to fix the
problem the user is having
120
Request
Contact
A developer asks the user to contact the developer through a given email. 119
Thanks A developer thanks the user for a positive review. 49
Next Release
Fix
A developer states that a fix for the user problem will appear in the next
release.
31
Current
Release Fix
A developer states that the problem is already fixed in the current release. The
user is asked to update their software.
26
Next Release
Feature
A developer states that the requested user feature will appear in the next
release.
12
Question A developer asks for clarification. 9
Other A response is in another language or does not address the user review. 6
Rating-review
mismatch
A developer asks why the positive review does not match the negative rating
or vice-versa.
6
Current
Release
Feature
A developer states the current release contains the feature. The user is asked
to update their software.
3
TABLE II
TYP ES OF R EV IEW S THAT A RE WR IT TEN B Y US ERS I N OU R SAM PLE .
Review Type Description Amount
Praise A user states they are pleased with the app. 129
Bug A user states there is an error 126
report or unexpected behavior occurring.
Dispraise A user complaints about the content or features of
the app.
120
Request A user makes a request for a new feature or
addition to the app.
89
Update issue A user preferred the previous release of the app. 36
Other app A user mentions a competitor to the app. 10
Other Another language. 10
These are specific issues that can be addressed by developers.
Users that wrote reviews concerning problems associated with
a specific Samsung phone were the most likely to change
their review following a response. The responses gave specific
advice about the phone.
The most common response topic, at 7%, was concerned
with developers notifying the user that a requested feature
is either in development or is planned. Once again, the
most common response topic is not the one associated with
the greatest rating change. The response topic that had the
highest chance of changing a rating (with 39.1% chance) was
a topic on notifying the user that the issue, about which the
user had originally complained, had been resolved. The users
may not have known that the issue was fixed and being told
personally resulted in a rating change.
B. Manual analysis
We next looked at the review and response types that occur
in our manually labelled data. In our manual analysis, more
than one type can occur in the same review or response. Any
review or response that does not conform to one of the types is
considered as ‘other’. The ‘other’ reviews and responses were
usually written in a language other than English or were not
written in coherent English.
We only found six out of 17 topics that are observed by
Pagano and Maalej (shown in Table II). The six observed
topics are praise, bug report, dispraise, request, update issue
and other apps. We think the reason is that we focus on top
free apps from Google Play Store, while Pagano and Maalej
focus on both paid and free apps in Apple App Store. Some of
the topics, such as dissuasion, would only appear for paid apps
and some topics, such as howto, are more likely to appear for
non-top apps. In addition, we could not clearly differentiate
between a feature request and an improvement request, since
we are not domain experts of the apps. We combined both
topics as one topic, i.e., request. We observed that reviews
contained a mixture of topics between praise and dispraise.
The most common issues were praise for the app, followed
by the reporting of bugs and dispraise about the content of
the app.
We also find ten common response types in developer
responses, as Table III shows. The ten types are instructions,
request contact, thanks, next release fix, current release fix,
next release feature, question, rating review mismatch, current
release feature and other. The most common response type is
instructions on how to solve the user problem, the second most
common being a canned request to email the developers. The
other types include reassurances that a problem is already fixed
or that the problem would be fixed in an upcoming release. The
same reassurances are provided to users who complained about
a lack of a feature. The last three types are either thanking the
user for leaving a kind review, asking why the negative rating
of the review does not match the positive review or inquiring
for further information about the user’s complaint.
Finally we matched the reviews and responses in a table
to show which response types occur most often with review
types. As Table IV shows, most responded-to-reviews are
praise, requests, and bug reports reviews. The majority of the
4
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
responses to these reviews are either direct instructions from
the developer on how to solve the problem or the developer
asking the user to contact them by email. We find that often
a user would leave praise for an app but then either have a
request or a minor problem to which the developer would
respond.
VI. TH RE ATS TO VALI DI TY
Some threats could potentially limit the validity of our
results. We now discuss such threats and how we control or
mitigate them.
Construct Validity. We investigate the issues raised in reviews
using LDA. LDA may not have optimized results when applied
on short text, such as tweets and app reviews. However, prior
research has shown that LDA can successfully extract topics
from tweets [20]. Since, we manually labelled our dataset of
reviews with the different issue types, some reviews may have
been incorrectly labelled. To mitigate this threat, we performed
this labelling in an iterative manner, went over each review
multiple times to ensure correct labelling of the reviews.
Internal Validity. There may be reviews that are spam re-
views. To prevent spam in app reviews, Google requires users
to log in using their Google Id before reviewing. We believe
that the impact of spam reviews is likely minimal, nevertheless,
future studies should evaluate the impact of spam reviews on
recent research that mines reviews.
Threats to External Validity. The selection of the top apps
could bias our results. Given the large number of unsuccessful
and spam apps in the store, we feel that our study of top apps is
warranted instead of blindly studying all apps. We only studied
apps that are free. Paid apps may exhibit different reviewing
and developer response patterns in comparison to free apps.
However, many paid apps have free versions available to
download and there are considerably more free apps than paid
apps in the Google Play Store. Moreover, many free apps
have in-app purchase features. Such apps need to consider the
value of reviews for financial reasons. Future studies should
carefully tag apps based on whether they are truly free or not.
Unfortunately such information is not easily accessible in an
automated fashion.
VII. CON CL US IO N
Most top apps do not respond to reviews, however re-
sponding can lead to a positive change in rating. Addressing
specific issues and notifying the users that requested features
are available are most likely to lead to a change in the review
rating.
REF ER EN CE S
[1] C. Iacob and R. Harrison, “Retrieving and analyzing
mobile apps feature requests from online reviews,” in
Proceedings of the Tenth International Workshop on
Mining Software Repositories. IEEE Press, 2013, pp.
41–44.
[2] L. V. Galvis Carre ˜
no and K. Winbladh, “Analysis of
user comments: an approach for software requirements
evolution,” in Proceedings of the 2013 International
Conference on Software Engineering, ser. ICSE ’13.
Piscataway, NJ, USA: IEEE Press, 2013, pp. 582–591.
[3] B. Fu, J. Lin, L. Li, C. Faloutsos, J. Hong, and N. Sadeh,
“Why people hate your app: Making sense of user feed-
back in a mobile app store,” in Proceedings of the 19th
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ser. KDD ’13. New York,
NY, USA: ACM, 2013, pp. 1276–1284.
[4] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao, and B. Zhang,
Ar-miner: Mining informative reviews for developers
from mobile app marketplace,” in Proceedings of the 36th
International Conference on Software Engineering, ser.
ICSE 2014. New York, NY, USA: ACM, 2014, pp.
767–778.
[5] D. Pagano and W. Maalej, “User feedback in the app-
store: An empirical study,” in Proceedings of the 21st.
IEEE International Requirements Engineering Confer-
ence. IEEE, 2013.
[6] M. Harman, Y. Jia, and Y. Z. Test, “App store mining
and analysis: Msr for app stores,” in Proceedings of the
9th Working Conference on Mining Software Repositories
(MSR ’12), Zurich, Switzerland, 2-3 June 2012.
[7] H.-W. Kim, H. L. Lee, and J. E. Son, “An exploratory
study on the determinants of smartphone app purchase,”
in The 11th International DSI and the 16th APDSI Joint
Meeting, Taipei, Taiwan, July 2011.
[8] S. M. Mudambi and D. Schuff, “What makes a helpful
online review? a study of customer reviews on ama-
zon.com,” MIS Quarterly, vol. 34, no. 1, pp. 185–200,
2010.
[9] E. Guzman and W. Maalej, “How do users like this fea-
ture? a fine grained sentiment analysis of app reviews,
in Proceedings of the 2014 IEEE 22nd InternationalRe-
quirements Engineering Conference (RE), Aug 2014, pp.
153–162.
[10] H. N. Walid Maalej, in Proceedings of the
23rd IEEE International Requirements Engineering
Conference, 2015, p. to appear. [Online]. Available:
https://mobis.informatik.uni-hamburg.de/wp-content/
uploads/2015/06/review classification preprint.pdf
[11] R. Chandy and H. Gu, “Identifying spam in the ios app
store,” in Proceedings of the 2nd Joint WICOW/AIRWeb
Workshop on Web Quality. ACM, 2012, pp. 56–59.
[12] H. Khalid, E. Shihab, M. Nagappan, and A. Hassan,
“What do mobile app users complain about? a study on
free ios apps,” IEEE Software, vol. PP, no. 99, pp. 1–1,
2014.
[13] V. mobile, “Developer Economics Q1 2014: State of the
Developer Nation,” Tech. Rep., 05 2014.
[14] D. Pagano and B. Bruegge, “User involvement in soft-
ware evolution practice: a case study,” in Proceedings
of the 2013 International Conference on Software Engi-
neering. IEEE Press, 2013, pp. 953–962.
[15] W. Martin, M. Harman, Y. Jia, F. Sarro, and Y. Zhang,
“The app sampling problem for app store mining,” in
5
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
TABLE IV
COU PLI NG O F REV IEW T YP ES WI TH R ESP ON SE TY PE S.
Instructions Request Thanks New New Current Other Rating and Question Total
contact Release Release Release Review Responses
fix Feature Fix Mismatch
Praise 33 24 45 8 6 4 1 6 3 134
Request 35 31 3 7 10 4 2 0 1 130
Bug 39 60 3 10 4 12 0 0 6 124
report
Dispraise 52 39 2 16 2 10 1 0 2 93
Update 11 15 1 3 1 7 0 0 0 38
issue
Other 3 4 1 0 0 0 2 0 0 10
Other 3 5 1 0 0 0 0 0 1 10
apps
Proceedings of the 12th Working Conference on Mining
Software Repositories (MSR), Florence, Italy, 2015.
[16] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet
allocation,” the Journal of machine Learning research,
vol. 3, pp. 993–1022, 2003.
[17] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y.
Ng, and C. Potts, “Learning word vectors for sentiment
analysis,” in Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Hu-
man Language Technologies - Volume 1, ser. HLT ’11.
Stroudsburg, PA, USA: Association for Computational
Linguistics, 2011, pp. 142–150.
[18] B. Pang and L. Lee, “A sentimental education: Sen-
timent analysis using subjectivity summarization based
on minimum cuts,” in Proceedings of the 42nd annual
meeting on Association for Computational Linguistics.
Association for Computational Linguistics, 2004, p. 271.
[19] H. Khalid, E. Shihab, M. Nagappan, and A. E. Hassan,
“What do mobile app users complain about? a study on
free ios apps,” in IEEE Software. IEEE Press, 2014.
[20] L. Hong and B. D. Davison, “Empirical study of topic
modeling in twitter,” in Proceedings of the First Work-
shop on Social Media Analytics, ser. SOMA ’10. New
York, NY, USA: ACM, 2010, pp. 80–88.
6
This article has been accepted for publication in IEEE Software but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Indentifier 10.1109/MS.2015.149 0740-7459/$26.00 2015 IEEE
... Content analysis can be also beneficial for software engineers to understand whether cross-platform apps achieve consistency of users' perceptions across different app stores (Hu et al. 2018;Hu et al. 2019), or whether hybrid development tools achieve their main purpose: delivering an app that is perceived similarly by users across platforms (Hu et al. 2019). Finally, studying the dialogue between users and developers has shown evidences that the chances of users to update their rating for an app increase as result of developer's response to reviews (McIlroy et al. 2015;). ...
... Factors affecting the importance vary from e.g., the number of reviews in these groups (Chen et al. 2014;Zhou et al. 2020), to the influence of this feedback on app download (Tong et al. 2018), and the overall sentiment these comments convey (Licorish et al. 2017;Gunaratnam and Wickramarachchi 2020). In line with this direction, mining approaches have been elaborated to recommend feature refinement plans for the next release (Licorish et al. 2017;Zhang et al. 2019), to highlight static analysis warnings that developers should check (Wei et al. 2017), to recommend test cases triggering bugs (Shams et al. 2020), to indicate mobile devices that should be tested (Khalid et al. 2014), and to suggest reviews that developers should reply (Greenheld et al. 2018;Srisopha et al. 2020c); the approaches can analogously recommend responses for these reviews (Greenheld et al. 2018;, stimulating users to upgrade their ratings or to revise feedback to be more positive (McIlroy et al. 2015;Vu et al. 2019). ...
... Help desk typically provides end-users with answers to their questions, resolve their problems or assist in troubleshooting (Bourque et al. 1999). Analogously, app developers can respond to specific user reviews to answer users' questions, to inform about fixing problems or to thank users for their kind remarks about apps (McIlroy et al. 2015;Srisopha et al. 2020a;Srisopha et al. 2020c). Though the task is not traditionally included in the typical responsibilities of software engineers, user support and managing the product reputation on the app store are essential to the app success; they should be viewed as important activities in in the software lifecycle. ...
Article
Full-text available
App reviews found in app stores can provide critically valuable information to help software engineers understand user requirements and to design, debug, and evolve software products. Over the last ten years, a vast amount of research has been produced to study what useful information might be found in app reviews, and how to mine and organise such information as efficiently as possible. This paper presents a comprehensive survey of this research, covering 182 papers published between 2012 and 2020. This survey classifies app review analysis not only in terms of mined information and applied data mining techniques but also, and most importantly, in terms of supported software engineering activities. The survey also reports on the quality and results of empirical evaluation of existing techniques and identifies important avenues for further research. This survey can be of interest to researchers and commercial organisations developing app review analysis techniques and to software engineers considering to use app review analysis.
... Specifically, they found that responding to a review increases the chances of a user updating their given rating by up to six times in comparison with no responding. McIlroy et al. [40] discovered that users change their ratings 38.7% of the time following a developer response, with a median increase of 20% in the rating. Despite of the advantage of review response, developers of many apps never respond to the reviews [30,40]. ...
... McIlroy et al. [40] discovered that users change their ratings 38.7% of the time following a developer response, with a median increase of 20% in the rating. Despite of the advantage of review response, developers of many apps never respond to the reviews [30,40]. One major reason is the plentiful reviews received for the mobile apps, e.g., the Facebook app on Google Play collects thousands of reviews per day [11]. ...
... Srisopha et al. [49] investigate which features of user reviews spur developers' responses, and find that ratings, review length and the proportions of positive and negative words are the most important features to predict developer responses. Both McIlroy et al. [40] and Hassan et al. [30]'s studies observe the positive impact of developers' responses on user ratings, for example, users would change their ratings 38.7% of the time following a response. To alleviate the burden in the responding process, Gao et al. [23] propose an NMT-based approach named RRGen for automatically generating the review responses. ...
Preprint
User experience of mobile apps is an essential ingredient that can influence the audience volumes and app revenue. To ensure good user experience and assist app development, several prior studies resort to analysis of app reviews, a type of app repository that directly reflects user opinions about the apps. Accurately responding to the app reviews is one of the ways to relieve user concerns and thus improve user experience. However, the response quality of the existing method relies on the pre-extracted features from other tools, including manually-labelled keywords and predicted review sentiment, which may hinder the generalizability and flexibility of the method. In this paper, we propose a novel end-to-end neural network approach, named CoRe, with the contextual knowledge naturally incorporated and without involving external tools. Specifically, CoRe integrates two types of contextual knowledge in the training corpus, including official app descriptions from app store and responses of the retrieved semantically similar reviews, for enhancing the relevance and accuracy of the generated review responses. Experiments on practical review data show that CoRe can outperform the state-of-the-art method by 11.53% in terms of BLEU-4, an accuracy metric that is widely used to evaluate text generation systems.
... According to Google Play, responding to user reviews results in an average increase of 0.7 stars for an app [15]. Moreover, previous studies [39,43] showed that 38.7% of users update their ratings after getting a response to their review (or question), and on the other hand, 69% of users take negative steps such as uninstalling an app when their reviews are not responded to. While developers recognize the importance of responding to user reviews promptly, the proliferation of reviews makes it virtually impossible to manually provide responses to all reviews. ...
... The authors in [43] studied user reviews and developer responses, and showed that users take negative steps such as uninstalling an app when developers fail to respond to their concerns. A user reviews study [39] on Google play found that 13.8% of app developers respond to at least one review, and showed a positive impact of developer responses; i.e., 38.7% of the users update star ratings upon response. Recently, Google Play observed that responding to user reviews yields 0.7 mean improvement in the rating (out of 5 stars) [15]. ...
Preprint
Full-text available
Hundreds of thousands of mobile app users post their reviews online including feedback and queries. Responding to user reviews promptly and satisfactorily improves application ratings, which is key to application popularity and success. The proliferation of such reviews makes it virtually impossible for developers to keep up with responding manually. To address the challenge, recent work has shown the possibility of automatic response generation by training a seq2seq model with a large collection of review-response pairs. However, because the training review-response pairs are aggregated from many different apps, it remains challenging to let the model generate app-specific responses, which, on the other hand, are often desirable as apps have different features and concerns. To enable app-specific response generation, this work proposes AARSynth: an app-aware response synthesis system. The key idea behind AARSynth is that it augments the seq2seq model with information specific to the app. Given an input review, AARSynth first retrieves the top-K most relevant app reviews and the most relevant snippet from the app description. The retrieved information, together with the input review, is then fed into a fused machine learning model that integrates the seq2seq model with a machine reading comprehension model. The latter helps digest the relevant reviews and app description. Finally, the fused model generates a response that is customized to the given app. We evaluated AARSynth using a large corpus of reviews and responses from Google Play. The results show that AARSynth outperforms the state-of-the-art system by 22.20% on the BLEU-4 score. Furthermore, our human study shows that AARSynth produces a statistically significant improvement in response quality compared to the state-of-the-art system.
Article
With the development of information technology, the subscription economy and streaming service market have grown rapidly. In particular, music streaming services have disrupted the traditional music industry. Nevertheless, few attempts have been made to understand music streaming services in terms of overall customer satisfaction. This study analyzes social media data to investigate the determinants of customer satisfaction in music streaming services. Topic modeling and text regression were applied to online app reviews for five music streaming services. This study finds that customers comment on factors related to usage environments, price plans, and content. All environment-related factors, some pricing-related factors, and content-related factors have a significant effect on customer satisfaction. In addition, the satisfaction determinants differ for each service. This study is an early attempt to analyze music streaming services from a data-driven perspective and contributes to a comprehensive understanding of music streaming services from the customer's point of view.
Article
User experience of mobile apps is an essential ingredient that can influence the user base and app revenue. To ensure good user experience and assist app development, several prior studies resort to analysis of app reviews, a type of repository that directly reflects user opinions about the apps. Accurately responding to the app reviews is one of the ways to relieve user concerns and thus improve user experience. However, the response quality of the existing method relies on the pre-extracted features from other tools, including manually labelled keywords and predicted review sentiment, which may hinder the generalizability and flexibility of the method. In this article, we propose a novel neural network approach, named CoRe, with the contextual knowledge naturally incorporated and without involving external tools. Specifically, CoRe integrates two types of contextual knowledge in the training corpus, including official app descriptions from app store and responses of the retrieved semantically similar reviews, for enhancing the relevance and accuracy of the generated review responses. Experiments on practical review data show that CoRe can outperform the state-of-the-art method by 12.36% in terms of BLEU-4, an accuracy metric that is widely used to evaluate text generation systems.
Chapter
Norms are general expectations of behavior in societies. Huge amount of computer-mediated interaction data available in the social media domain provides an opportunity to extract and study communication norms, both to understand their prevalence and to make informed decisions about adopting them. While interactions in social media platforms such as Twitter and Facebook have been widely studied, only recently researchers have started examining app reviews provided by users and the responses provided by developers in the domain of app development. In this vein, a lot of attention has been devoted to study the nature of user reviews, however, little is known about developer responses to such reviews. Additionally, no other prior work has scrutinized the nature of communication norms in this domain. Towards addressing these gaps, this work pursues three objectives using a dataset comprising user reviews and developer responses from Google’s top-20 apps used to track running with a total of 24,407 reviews and 2,668 responses. First, based on prior literature in computer-mediated interactions, the study identifies 12 norms in responses provided by developers in three categories (obligation norms, prohibition norms and domain-specific response norms). Second, it scrutinizes the awareness and adoption of these norms. Third, based on the results obtained, this study identifies the need for creating a response recommendation system that generates responses to user reviews either automatically, or with some help from the developers. The proposed response recommendation system is a normative system that will generate responses that abide by the norms identified in this work, and will also monitor potential norm violations (if the responses were to be modified by the developers). Development of such a system forms the focus of future work.
Article
Full-text available
A customer’s experience with a brand, as evidenced in online customer reviews, has attracted multidisciplinary scholarly attention. Customer experience plays an important role as an antecedent to brand engagement, brand adoption, and eventual brand loyalty. Thus, it is important for businesses to understand their customers’ experiences so that they can make changes as necessary. The COVID-19 pandemic has brought unprecedented changes to the business landscape, forcing businesses to move online, with many utilizing enterprise video conferencing (EVC) to maintain daily operations. To ensure efficient digitization, many turned to the online reviews of others’ experiences with EVC before engaging with it themselves. This research examined how the customer experience is portrayed through emotional tone and word choice in online reviews for the EVC platform Zoom. Using computerized text analysis, key differences were found in the emotional tone and word choice for low- and high-rated reviews. The complexity and emotionality expressed in reviews have implications on the usability of the review for others. The results from this study suggest that online customer reviews with a high rating express a higher level of expertise and confidence than low-rated reviews. Given the potential dissemination and impact, digital marketers may be well advised to first and foremost respond to online reviews that are high in emotional tone.
Chapter
School going students are often juggling between their studies and their parents’ expectations to perform well, often leading to educational stress. After a quick survey of 10th Grade students in India, it was found that difficulty in understanding mathematics is one of the leading causes of stress. To moderate such stress and help towards making mathematics a joyful experience, ReviseOnTheGo, a suite of mobile apps were developed. These applications were published to the Google Play Store and there are about 30000+ downloads in total to date. A lot of feedback through 484 ratings, about 100+ ratings with comments were received. Young students express their feedbacks differently, for instance, their ratings and comments may not be aligned. These comments and ratings are the only interface between the developer and the students, hence they are crucial for the fine-tuning of User Interface, Information Architecture, and its contents. There is a lack of analytics frameworks which differentiate between the choice of words and their intended meaning of these young students. The developed framework attempts to bridge this gap by performing sentiment analysis based on parameters such as discrepancy between the rating and tone of the comment, the student’s gender, feedback persona, use of slang, number of words used and grammatical errors. The framework provides a better view by objectively looking at these parameters and unequivocally classifying the emotional response.
Conference Paper
Full-text available
App stores are not merely disrupting traditional software deployment practice, but also offer considerable potential benefit to scientific research. Software engineering researchers have never had available, a more rich, wide and varied source of information about software products. There is some source code availability, supporting scientific investigation as it does with more traditional open source systems. However, what is important and different about app stores, is the other data available. Researchers can access user perceptions, expressed in rating and review data. Information is also available on app popularity (typically expressed as the number or rank of downloads). For more traditional applications, this data would simply be too commercially sensitive for public release. Pricing information is also partially available, though at the time of writing, this is sadly submerging beneath a more opaque layer of in-app purchasing. This talk will review research trends in the nascent field of App Store Analysis, presenting results from the UCL app Analysis Group (UCLappA) and others, and will give some directions for future work.
Article
Full-text available
App stores allow users to submit feedback for downloaded apps in form of star ratings and text reviews. Recent studies analyzed this feedback and found that it includes information useful for app developers, such as user requirements, ideas for improvements, user sentiments about specific features, and descriptions of experiences with these features. However, for many apps, the amount of reviews is too large to be processed manually and their quality varies largely. The star ratings are given to the whole app and developers do not have a mean to analyze the feedback for the single features. In this paper we propose an automated approach that helps developers filter, aggregate, and analyze user reviews. We use natural language processing techniques to identify fine-grained app features in the reviews. We then extract the user sentiments about the identified features and give them a general score across all reviews. Finally, we use topic modeling techniques to group fine-grained features into more meaningful high-level features. We evaluated our approach with 7 apps from the Apple App Store and Google Play Store and compared its results with a manually, peer-conducted analysis of the reviews. On average, our approach has a precision of 0.59 and a recall of 0.51. The extracted features were coherent and relevant to requirements evolution tasks. Our approach can help app developers to systematically analyze user opinions about single features and filter irrelevant reviews.
Conference Paper
Full-text available
Many papers on App Store Mining are susceptible to the App Sampling Problem, which exists when only a subset of apps are studied, resulting in potential sampling bias. We introduce the App Sampling Problem, and study its effects on sets of user review data. We investigate the effects of sampling bias, and techniques for its amelioration in App Store Mining and Analysis, where sampling bias is often unavoidable. We mine 106,891 requests from 2,729,103 user reviews and investigate the properties of apps and reviews from 3 different partitions: the sets with fully complete review data, partially complete review data, and no review data at all. We find that app metrics such as price, rating, and download rank are significantly different between the three completeness levels. We show that correlation analysis can find trends in the data that prevail across the partitions, offering one possible approach to App Store Analysis in the presence of sampling bias.
Article
Full-text available
Mobile-app quality is becoming an increasingly important issue. These apps are generally delivered through app stores that let users post reviews. These reviews provide a rich data source you can leverage to understand user-reported issues. Researchers qualitatively studied 6,390 low-rated user reviews for 20 free-to-download iOS apps. They uncovered 12 types of user complaints. The most frequent complaints were functional errors, feature requests, and app crashes. Complaints about privacy and ethical issues and hidden app costs most negatively affected ratings. In 11 percent of the reviews, users attributed their complaints to a recent app update. This study provides insight into the user-reported issues of iOS apps, along with their frequency and impact, which can help developers better prioritize their limited quality assurance resources.
Article
With the popularity of smartphones and mobile devices, mobile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and downloads. App developers spend considerable effort on collecting and exploiting user feedback to improve user satisfaction, but suffer from the absence of effective user review analytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapidly increasing pool of user reviews, we present “AR-Miner” — a novel computational framework for App Review Mining, which performs comprehensive analytics from raw user reviews by (i) first extracting informative user reviews by filtering noisy and irrelevant ones, (ii) then grouping the informative reviews automatically using topic modeling, (iii) further prioritizing the informative reviews by an effective review ranking scheme, (iv) and finally presenting the groups of most “informative” reviews via an intuitive visualization approach. We conduct extensive experiments and case studies on four popular Android apps to evaluate AR-Miner, from which the encouraging results indicate that AR-Miner is effective, efficient and promising for app developers.
Article
User review is a crucial component of open mobile app markets such as the Google Play Store. How do we automatically summarize millions of user reviews and make sense out of them? Unfortunately, beyond simple summaries such as histograms of user ratings, there are few analytic tools that can provide insights into user reviews. In this paper, we propose Wiscom, a system that can analyze tens of millions user ratings and comments in mobile app markets at three different levels of detail. Our system is able to (a) discover inconsistencies in reviews; (b) identify reasons why users like or dislike a given app, and provide an interactive, zoomable view of how users' reviews evolve over time; and (c) provide valuable insights into the entire app market, identifying users' major concerns and preferences of different types of apps. Results using our techniques are reported on a 32GB dataset consisting of over 13 million user reviews of 171,493 Android apps in the Google Play Store. We discuss how the techniques presented herein can be deployed to help a mobile app market operator such as Google as well as individual app developers and end-users.
Conference Paper
Popular apps on the Apple iOS App Store can generate millions of dollars in profit and collect valuable personal user information. Fraudulent reviews could deceive users into downloading potentially harmful spam apps or unfairly ignoring apps that are victims of review spam. Thus, automatically identifying spam in the App Store is an important problem. This paper aims to introduce and characterize novel datasets acquired through crawling the iOS App Store, compare a baseline Decision Tree model with a novel Latent Class graphical model for classification of app spam, and analyze preliminary results for clustering reviews.
Conference Paper
User feedback is imperative in improving software quality. In this paper, we explore the rich set of user feedback available for third party mobile applications as a way to extract new/changed requirements for next versions. A potential problem using this data is its volume and the time commitment involved in extracting new/changed requirements. Our goal is to alleviate part of the process through automatic topic extraction. We process user comments to extract the main topics mentioned as well as some sentences representative of those topics. This information can be useful for requirements engineers to revise the requirements for next releases. Our approach relies on adapting information retrieval techniques including topic modeling and evaluating them on different publicly available data sets. Results show that the automatically extracted topics match the manually extracted ones, while also significantly decreasing the manual effort.
Conference Paper
Mobile app reviews are valuable repositories of ideas coming directly from app users. Such ideas span various topics, and in this paper we show that 23.3% of them represent feature requests, i.e. comments through which users either suggest new features for an app or express preferences for the re-design of already existing features of an app. One of the challenges app developers face when trying to make use of such feedback is the massive amount of available reviews. This makes it difficult to identify specific topics and recurring trends across reviews. Through this work, we aim to support such processes by designing MARA (Mobile App Review Analyzer), a prototype for automatic retrieval of mobile app feature requests from online reviews. The design of the prototype is a) informed by an investigation of the ways users express feature requests through reviews, b) developed around a set of pre-defined linguistic rules, and c) evaluated on a large sample of online reviews. The results of the evaluation were further analyzed using Latent Dirichlet Allocation for identifying common topics across feature requests, and the results of this analysis are reported in this paper.
Conference Paper
User involvement in software engineering has been researched over the last three decades. However, existing studies concentrate mainly on early phases of user-centered design projects, while little is known about how professionals work with post-deployment end-user feedback. In this paper we report on an empirical case study that explores the current practice of user involvement during software evolution. We found that user feedback contains important information for developers, helps to improve software quality and to identify missing features. In order to assess its relevance and potential impact, developers need to analyze the gathered feedback, which is mostly accomplished manually and consequently requires high effort. Overall, our results show the need for tool support to consolidate, structure, analyze, and track user feedback, particularly when feedback volume is high. Our findings call for a hypothesis-driven analysis of user feedback to establish the foundations for future user feedback tools.