ThesisPDF Available

Appcestry – A Tool for the Study of Mobile Application Similarities

Authors:

Abstract and Figures

Appcestry" - a portmanteau of "app" and "ancestry" - is a tool presented in this dissertation project for the study of similarities of software applications (apps) targeting the Android operating system. The tool extracts features from Android Application PacKages (APK) and transforms them into a format which this project names "AppGenes". AppGenes may be used by Appcestry and other machine learning tools to discover similar Android applications.
Content may be subject to copyright.
1 / 48
Appcestry A Tool f or the Stu dy of Mobile Application Similarities
Teng Hei Jason CHAO
chao@jasontc.net | https://jasontc.net
August 2018
This dissertation is submitted in part fulfilment of the requirements of the Centre for
Interdisciplinary Methodologies of the University of Warwick for the degree of Master of Science in
Big Data and Digital Futures.
2 / 48
Abstract
“Appcestry” – a portmanteau of “app” and “ancestry” is a tool presented in this dissertation
project for the study of similarities of software applications (apps) targeting the Android operating
system. The tool extracts features from Android Application PacKages (APK) and transforms them
into a format which this project names “AppGenes”. AppGenes may be used by Appcestry and
other machine learning tools to discover similar Android applications.
Keywords: mobile, android, application (app), similarity, machine learning, digital method
3 / 48
Contents
Abstract ................................................................................................................................................... 2
Background What and Why ................................................................................................................. 5
What is Appcestry? ............................................................................................................................. 5
Why a new tool for mobile media? .................................................................................................... 5
Why Android applications? ................................................................................................................. 6
Software, Service and Platform .............................................................................................................. 7
Delivery of software ............................................................................................................................ 7
Desktop applications ....................................................................................................................... 7
Mobile applications ......................................................................................................................... 7
Connectivity required ......................................................................................................................... 9
Platformisation ................................................................................................................................. 10
Reuse of software ............................................................................................................................. 11
Previous studies on similarities of Android applications .................................................................. 13
Android Applications ............................................................................................................................. 14
What is Android application? ............................................................................................................ 14
Types of Android applications, from software developers’ perspective .......................................... 16
Extraction of Features from Objects in Android Applications .......................................................... 17
Software code ............................................................................................................................... 17
Namespaces .................................................................................................................................. 25
Android permissions ..................................................................................................................... 25
XML files ........................................................................................................................................ 25
Images ........................................................................................................................................... 25
All other files ................................................................................................................................. 27
AppGene format ........................................................................................................................... 27
Comparing AppGenes for Similarity ...................................................................................................... 28
1 to 1 comparison ............................................................................................................................. 28
Usage for general users .................................................................................................................... 29
Usage for advanced users ................................................................................................................. 31
Clustering .......................................................................................................................................... 33
Design and Architecture of Appcestry .................................................................................................. 35
Potential Use Cases ............................................................................................................................... 35
Use Case 1: University apps .............................................................................................................. 36
Use Case 2: Quit-Pornography apps ................................................................................................. 38
4 / 48
Further development ............................................................................................................................ 40
Conclusion ............................................................................................................................................. 40
References ............................................................................................................................................ 41
5 / 48
Background What and Why
What is Appcestry?
This dissertation project wishes to present the tool Appcestry to enable the study of similarities of
Android applications. In a nutshell, Appcestry provides a web-based interface for general users and
an Application Programming Interface (API) for programmers to:
1. Extract features from an Android Application PacKages (APK) and exports them in a
format which I name AppGenes;
2. Compare the similarities of AppGenes; and
3. In cooperation with machine learning tools, discover clusters of Android applications
through AppGene similarity.
Due to the length of the source code of Appcestry, the code is not included in the appendix of this
dissertation. The source code is accessible on GitHub at https://github.com/jason-chao/appcestry.
A working service of Appcestry is accessible at http://appcestry.jasontc.net/
Why a new tool for mobile media?
The proliferation of mobile devices has attracted interests from different disciplines in the studies of
mobile media. In the social sciences, Campbell (2013) advocated the creation of the field of “mobile
communication studies. Campbell suggested an expansion of the scope of studies from the
conventional notion of “cell phone to all network-capable mobile devices. Campbell put an
emphasis on the effect of the affordability of the mobile media especially, in developing countries
where building an infrastructure to run a mobile network service is more affordable than “other
wireless and fixed media”. In Campbell’s conclusion, the study of mobile communication should be
considered as a field “integrally connected to the study of media and communication more broadly”.
In the business world, mobile media received high attention of the business world on the potentials
to improve brand images and customer relationships (Ford, 2017). Businesses, in particular, start-
up companies are adopting the “mobile first strategy” to engage with web users (Wilson, 2018).
Wroblewski (2011) in his book Mobile First advised web designers to be prepared for a transition of
the web from desktop browsers to those of mobile devices. Later, the term “mobile first” entered
the business world to describe businesses, products and services that are designed to primarily
interact with their customers through mobile media devices (Gazdecki, 2018).
Studying the effects of mobile media helps us understand the society. Researchers found
associations between some socio-economic characteristics and the use of mobile media. Chen
(2015) found that mobile media tended to mobilise cultural participation in the United States (U.S.).
When mobile media is taken into account, people of low education backgrounds are more active in
cultural participation than those of high education backgrounds. He thought that mobile media
would be one of the means to narrow the urban-rural divide in terms of cultural participation.
Furthermore, Kabali, Irigoyen, Nunez-Davis, Budacki, Mohanty, Leister and Bonner (2015) reported
the phenomenon of early adoption of mobile devices by children from low-income, urban and
minority backgrounds.
6 / 48
To study digital media in relation to society, there is a need to develop new methods and new tools.
Rogers (2013) proposed that researchers should apply “method of the medium” to study society
considering the social and technological changes. The distinct characteristics of each medium
require us to apply methods specific to a medium to conduct research. Also, Rogers hinted an
obsolesce of the ideas of “the virtual” world. For example, the applicability of “web-native” tools
should not be confined to the study of “online culture”, rather, be extended to the greater social
world. The deep integration of digital mobile media into our lives suggests that tools designed to
study mobile media may also be used to study society.
According to Lury and Wakeford (2012), research methods need to be inventive to become capable
of helping researchers investigate “the open-endedness of the social world” which is “ongoing,
relational, contingent and sensuous”. They offered an insight for us to reflect on whether the
methods that established in the past would be able to answer questions facing the researchers.
They suggested that the design of a research method should be specific to the problem. Under the
slogan “we are all social researchers now”, Marres (2012) discussed the impact of the availability of
“social research techniques” to “a variety of social actors”. She saw that social methods may be
used by diverse actors and may serve a variety of purposes.
Appcestry is not positioned to test a particular hypothesis or answer a specific question, rather, is
designed as a tool for researchers to “interrogate” the truth. In a discussion on machine learning by
Lury and Wakeford (2012) and Parisi (2012), they suggested that knowledge might be built by
formulating and testing speculative hypotheses to approximate the truth step-by-step, just like an
“interrogation”. The way that Appcestry facilitates knowledge discovery is abduction inference
rather than causal (induction and deduction) inference. Similarities of mobile applications may help
people understand more about the world through “interrogations”. Although Appcestry focuses on
the technical features of Android applications, I position Appcestry as a tool to create opportunities
for the development of methods to study the social world from data about mobile media.
Why Android applications?
Android is an open source operating system mainly contributed by Google (Android Open Source
Project, 2018). Google makes software development kit (SDK) tools, integrated development
Environment (IDE) and documentation freely available to developers to build applications targeting
Android (Google, 2016).
Devices running Android are ubiquitous. The global market share of Android in the market of
operating systems for mobile devices varies from report to report, some of the numbers published
are 85.9% as of Q1 2018 (Statista, 2018), 77% as of June 2018 (Sterling, 2018) and 77.32% as of July
2018 (Statcounter, 2018). Despite a lack of consistent figures in the reports, drawing on the lowest
number, we can believe at high confidence that Android’s share is at least three fourths of the global
market. The dominance of the Android platform justifies efforts to study the ecology of the
platform.
7 / 48
Software, Service and Platform
Delivery of software
Desktop applications
The development of digital technologies and platforms went hand-in-hand with capitalism (Srnicek,
2017). The notion of software has been changing since the conception of the term in 1958 as a
reference to “mathematical and logical instructions for electronic calculators” (Fuller, 2008). Early
computer manufactures bundled hardware with software (Fuller, 2008). Separately sold software
did not appear in the market before the International Business Machine’s (IBM) announcement of
un-bundling hardware and software, as a result of lessons learnt from an earlier anti-competition
lawsuit brought by the U.S. government over IBM’s monopoly of the computer market (Lowood,
2001). Except for academics, there were no independent software developers before 1969
(Lowood, 2001). IBM’s un-bundling led to the growth of the software development industry
(Lowood, 2001). Indeed, software was not considered intellectual property until 1980 in the U.S.
(Craig, 2005). As in Microsoft co-founder Gate’s (1976) famous “An Open Letter to Hobbyists”, he
complained that computer hobbyists had not paid the royalties to software developers for making
copies of his software Altair BASIC.
The growth of the Internet accelerated the move from purchase of software to subscription-based
access to software. The idea that software as a standalone commodity has faded since the 2000s
(Kaldrack & Leeker, 2015). The business model known as Software as a Service or “SaaS enabled
users to consume software as a service on smaller, more energy-efficient and sometimes cheaper
devices as long as the devices are Internet-enabled (Kaldrack & Leeker, 2015). Evans, Hagiu and
Schmalensee (2008) discussed the significance of software platforms as the as the centroid of
economies. Products and services are increasingly dependent on the software platforms that
connect multisided parties in economic activities.
Mobile applications
Before Android and Apple’s iOS came into existence, there had been a class of mobile devices
popular during the late 1990s to early 2000s called “personal digital assistant” (PDA) (Martin, 2004).
For the purpose of this text, I consider PDA as a predecessor to the class of devices that we now call
“smartphones”. In today’s context, PDAs can be considered as “smartphone” without “a phone”.
Palm and Windows Mobile were the mainstream mobile operating systems of PDAs (Martin, 2004).
Similar to early electronic computers, Evans, Hagiu and Schmalensee (2008) pointed out that makers
of early handheld devices intended to bundle with these devices as many features as possible.
However, they suggested that poorly implemented bundling was one of the causes of their failure, in
particular, Apple’s Newton. They found that a key factor in the success of Palm PDAs in the late
1990s and early 2000s was to let its hardware and software disintegrate. Palm started to license
other device manufactures to make Palm devices in 1997. Palm also provided developers with free
tools to write applications for Palm devices. Eventually, Palm evolved to become a software
platform for users, application developers and hardware builders (Evans, Hagiu and Schmalensee,
2008).
8 / 48
According to Rhodes, McKeehan (2002), Ilyas and Ahson (2006) PDAs running Palm and Windows
Mobile provided only essential personal information management (PIM) functionalities, like
calendar, address book etc. To get more out of the computing device, users needed to install
applications to expand the devices’ capabilities. Users may download files from the web on desktop
computers then “synchronise” or copy them onto the Palm and Windows Mobile devices. The
experience of download and installing a mobile application was not that different from installing a
desktop application. First, users had to download an installation file to the desktop computer. Then,
the users open the installation file on a desktop computer or, in the case of Windows Mobile,
transfer the installation file to storage attached to the Windows Mobile then open it on the device.
However, the freedom to install software on mobile devices started to fade. When Apple first
launched iOS the mobile operating system for iPhones in 2007, initially, Apple did not allow third
party applications to be installed on iOS devices and encouraged third party developers to use web
applications to target iOS users (9to5mac, 2011). Only after protest did Apple officially open an
“App Store” for third party developers to publish their applications in early 2008. iOS is a closed
source and proprietary operating system. For iOS devices, App Store is the only distribution channel
of mobile applications.
On the other hand, Android is open source. Developing and deploying derivatives of Android is
technically possible. “Google Play” (originally named “Android market”) is Google’s official store for
android applications though, users can install applications without going through Google Play. Users
can download APK files from any market or website and open locally on Android devices to install
the application. However, Android would prompt that the application is from an “unknown source”
and required the users to explicit allow applications not downloaded from Google Play to be
installed (Gilbert, Chun, Cox & Jung, 2011).
Android Applications may be installed on an Android device in two ways:
1. Download and install from a market
Google Play (formerly called “Android Market”) is the most popular market as it is pre-
installed on devices. Other markets exist such as F-Droid a market dedicated to open
source applications. Google Service Framework (GSF) is a proprietary software package
preinstalled on Google-licenced devices. Google Play, Google map and Google’s non-free
APIs requires GSF. GSF can be seen as what distinguishes paying and non-paying Android
devices. Phone makers like XiaoMi and Amazon do not support Google services but still
reliant on the Android operating system (Hill, 2016).
2. Download and install an APK files
Users may also download APKs from the web and install them on their devices without going
through any of the markets. However, by default, Android allows only trusted sources.
Users had to manually to set Android to allow installation of applications of “unknown
sources” (Graziano, 2013).
Now, both iOS and Android platforms require or prefer users to obtain applications through
channels sanctioned by Apple or Google. The rise in the adoption of devices running iOS and
Android and the popularity of mobile applications gave the parties in control of the distribution
channel of applications - Apple and Google the enormous power to exert influence over consumer
9 / 48
choices. From the early 2010s, the number of mobile applications boomed and at the time the
consolidated the influence of Apple’s App Store and Android’s Google Play in mobile media (Zhong &
Michahelles, 2013). App Store and Google Play changed the whole experience to just a couple of
taps. Drawing on Gerlitz’s use of the word “grammar” to describe user habitats shaped by
application usages, the experience of downloading and installing a mobile application in a single tap
can be seen a new “grammar” for the obtaining mobile media. The ease of use of centralised App
Store and Google Play may have boosted the adoption and consumption of mobile apps.
From the discussion above, we see that software for desktop computers and mobile devices,
respectively, have major turns in parallel in their histories. Both desktop and mobile applications
first unbundled with hardware to become standalone objects. Then, later, the applications
developed a tendency to be delivered through external services. The delivery of mobile applications
becomes increasingly centralised.
Connectivity required
Traditionally, applications for desktop computers are assumed to work offline, without a need for an
active Internet connection. Desktop applications often carry a connotation of a self-contained
programme that should work locally on a computer, although more and more desktop applications
require connecting to an external service to work (Lynch, 2017). Furthermore, the default security
settings on modern operating systems would not allow a desktop application to reach out to an
external server. For example, Windows requires alleviated privileges when a user attempts to install
or run a desktop application that reaches out to a server on the Internet (Microsoft, 2004).
By contrast, mobile applications do not carry the offline connotation. It does not seem unusual for
mobile applications to alert users “no internet connection” and refuse to work. Also, Android no
longer prompts a user to allow a mobile application to connect to the Internet (Toombs, 2015).
There has been a decline in desktop applications that require users to install locally before use.
Since the 2000s, software makers started their move to web applications that do not require users
to install extra software packages on their local computers. In the early 2000s, the rise of web
browsers and the introduction of web standards that enable rich user interaction suggested a trend
of phasing out desktop applications in favour of web applications (Anderson & Wolff, 2010).
Anderson and Wolff suggested a “post-web” future in light of the emergence of iPhone/iPad devices.
They also brilliantly predicted that the users would eventually choose to move away from openness
in favour of a well-maintained closure. “Much as we love freedom and choice, we also love things
that just work, reliably and seamlessly” they said.
The introduction of the web technologies of Asynchronous JavaScript And XML (AJAX), CSS and
HTML5 helped close the gap between desktop and web applications in terms of technical capabilities
and user experience. According to Garrett (2005, in the mid-2000s, the emergence of AJAX
transformed the architecture of web applications. Prior to AJAX, most user interactions require the
browser to reload a web page, involving retrieval of data, processing and generating HTML code.
The introduction of XMLHttpRequest component in web browsers enables the JavaScript code
embedded in web pages to send and receive data through HTTP requests. Alienating the
presentation logic from the data processing logic not only reduced the need of page reloads but also
was in line with the design principle of “separation of concerns” enshrined by experienced software
10 / 48
engineers (Laplante, 2007). Web service (HTTP-based interface) is seen as a preferable way for the
communication between clients and servers (World Wide Web Consortium, 2007).
The practice of serving frontends by providing an HTTP-based backend interface carried on to mobile
applications. Cloud service providers took web service a step further by encouraging software
developers to adopt the model which web applications and mobile applications share the same web
service (Lavigne, 2017). Some service providers even abstracted the essential services needed for
the development of mobile applications, such as push notification, serverless functions and data
storage, as what collectively known as “Mobile Backend as a Service” (MBaaS).
Mobile applications are more likely not self-contained” but rather behave quite similarly to web-
based applications for their reliance on web-based backend services and, of course, an internet
connection to work. Indeed, users have to cede control to the operators of the external services
when using these applications.
Platformisation
The way that mobile media work and are delivered to users shows signs of re-centralisation. Such
centralisation mirrors that of the web. Insights from the centralisation of the web might be helpful
in the study of mobile media.
In the mid-2000s, the web was no longer seen as merely a medium to publish information but also a
platform to build applications on (Helmond, 2015). Gerlitz and Helmond (2013) argued that social
media sites transformed themselves into social media platforms when they launched social media
plugins and APIs for integration into the larger web. The social plugins embedded in third party
websites changed the dynamics of the data flow on the web. Now the social media platforms
capture the users’ interactions on the web. Gerlitz and Helmond (2013) called the phenomenon the
recentralisation of the web because of the power of the data in possession of the platforms that can
yield influence on the political economy. In the emergence of platforms, capitalism restructured
itself based on technologies in which “extracting and controlling data” becomes a key business
model (Srnicek, 2017).
Based on the web’s development, scholars also started to investigate the political economy of
mobile media. Gerlitz et al. (2016) studied the relations between mobile applications and social
media platforms. They found that the Application Programming Interfaces (API) exposed by
Facebook and Twitter to mobile applications are forming a “support ecology” for mobile
applications. Helmond et al. (2018a) proposed a pyramid model to study the data infrastructure of
mobile applications to understand the “technological and economic phantasy and to create a critical
stance towards it”. The first (bottom) layer is a list of tracker libraries that an app links to. The
second (middle) layer is the routing information of the data coming in and out of an app. The third
(top) layer is the content and parameters that an app sends to a server. Also, the patterns of
trackers differ across different categories of mobile applications. In addition to cloud service
providers Amazon, Google and Cloudflare, Facebook, a social media platform, was found to be an
important destination of mobile application’s data. They identified “black-boxing” of apps as an
obstacle in the study of mobile applications and the infrastructure beneath them.
Android is a multisided platform that brings users of mobile devices, businesses, software
developers and device manufacturers together (Srnicek, 2017). Businesses and software developers
11 / 48
who use the Android platform to publish software applications to the users are also players in the
ecology. Understanding the parties behind the technical infrastructure supporting a mobile
application would also reveal power relationships within the ecosystem of mobile applications.
Given the high market share of Android, studying the similarity between mobile applications will
enable more investigations into the ecologies of the platforms.
Figure 1. The App pyramid model and its drawing adapted from the study by Helmond et al.
(2018). I propose that the means enabled Appcestry to study mobile applications is on the first
level of the model.
Reuse of software
Reuse of software, of course, is a contributor to the similarities of applications. Erickson and Kelty
(2015) drew on ideas from evolution to describe the characteristics of software and services. They
suggested that software studies view software from an evolutionary perspective the ancestry of
software. They proposed the application of evolutionary theory in the analysis of software to study
“the values, ideologies or cultural technologies at work”. They pointed out that the evolution of
software is not just a matter or change but how “aspects of the past are preserved differentially in
different ecologies”. They borrowed concepts from developmental-evolution theory to study
software “generative entrenchment” and “scaffolding”. They saw software stacks in software
development as an inheritance rather than as what they are usually illustrated in technical
documents as “layers”. Erickson and Kelty recognised the lack of vocabulary to describe the diversity
of software. Words usually used to describe software, such as “outdated”, “obsolete”, “cutting
edge” and “new”, favour linear progress. They suggested a new path to study software “beyond old
and new” – identification of “distinct population”. The distances between pieces of software reveal
how “entranced” and “generative” software applications are in a software ecology. Also, they
apk signatures - appgenes
12 / 48
offered an insightful suggestion which the study of the ancestry of software in an evolution sense, in
lieu of mere novelty, would allow us to understand the software’s “real effects”.
In line with Erickson and Kelty’s use of terminology from evolution theory in the study of software
ecologies, the name for the software programme developed for this dissertation project is called
“Appcestry” - is the portmanteau of “app” and “ancestry”. Likewise, the features extracted from an
application to be used in the comparison or formation of clusters are named “AppGenes” - the
portmanteau of “app” and “genes”.
The practice of reuse of software code dates back to the dawn of programming (Frakes and Kang,
2005). Researchers approached the matter of code reuse from different perspectives. Code reuse
in general seems to carry a negative connotation in the education of computer programming. Tools
for analysis of code reuse were built around the use case of detecting plagiarism in student
assignments, notably JPlag (Malpohl, 2005), MOSS (Aiken, 2010) and Plaggie (Ahtiainen, Surakka and
Rahikainen, 2006). On the contrary, in the industry and the open source community, the attitude
towards code reuse is much more welcoming. On the management level, Frakes and Terry (1996)
advised companies to systematically apply “the most effective reuse strategies” to “improve
productivity and quality” in software development. On the coding level, Hunt and Thomas (1999)
warned programmers of “the evil of duplication”. They advised programmers to adopt a principle
called “Don’t Repeat Yourself” (DRY). Programmers should be watchful of whether data types and
functions are being repeated in their code. For example, the authors recalled the story of a U.S.
state government’s discovery of more than ten thousand versions of code performing the same
functionality validation of US Social Security numbers - across its computer systems in use. They
saw that such duplication would “lead to maintenance problems”. Hunt and Thomas recommended
programmers to make their code adaptable to changes and easy for others to reuse.
In Louden and Lambert’s (2011) pointed out that a good implementation of abstractions in coding is
conducive to the reusability of code. They also suggested that making abstract data types and
procedural controls easy for humans to understand and to write is one of the key features of
modern programming languages.
In the world of open source software, Mockus (2007) found that 50% of files in open source projects
had been reused in more than one project. They argued against the belief of “proliferation of code
copying” as “a bad practice”. They saw that the contribution of the practice of code reuse to “the
evolutionary development of the open source software” was significant. In addition, Haefliger, Von
Krogh and Spaeth (2008) recognised that code reuse helps bring down software costs and integrate
functions quickly.
Raymond and Enterprises (1997) said that an important trait of a “great” programmer is
“constructive laziness” – a mindset to deliver results with fewer efforts. They gave Linus Torvalds,
the programmer who started the kernel of the Linux operating system, as an example. Torvalds did
not write Linux from scratch, rather, reused code from Minix as scaffold the Linux project.
Sojer and Henkel (2010) found that rapid development of software is associated with code reuse.
They also found that programmers who loved to “tackle difficult technical challenges” themselves
were less likely to reuse code. However, they suggested that the obsession with technical challenges
would be detrimental to innovation.
13 / 48
It is clear that real-world software developers see the practice of software reuse desirable. It is
reasonable to expect that code reuse is also widely practised in the development of mobile
applications.
Previous studies on similarities of Android applications
There are a number of studies on applying machine learning methods to the study of the similarity of
Android applications. However, the focuses of the studies were on malware detection, plagiarism
detection and security analysis.
In a paper by Gonzalez, Stakhanova and Ghorbani (2014), they developed a tool called Droidkin for
“assessing the similarity of Android applications” at the code-level for the detection of malware code
within applications. However, I found that Droidkin’s repository on GitHub is effectively empty.
Chen, Hoi, Li, Xiao (2015), Huang, Zhu, Liu and Wu (2013) developed tools to analyse APKs’ code
similarity to detect plagiarised or repackaged Android applications. Chen, Lin, Hoi, Xiao and Zhang
(2014) used Android applications’ titles, descriptions and screenshots available on application
markets to find an application’s similar applications as an alternative to the “similar apps” suggested
by the markets.
Shabtai, Fledel and Elovici (2010), W. Zhou, Y. Zhou, Grace, Jiang and Zou (2013) and Lindorfer,
Neugschwandtner and Platzer (2015) developed programmes to extract a range of features from
APKs, including code, XML attributes, names and numbers of files to classify malicious
applications”.
Despite the claims in the literature that a number of “tools” have been developed to study Android
application similarity, I failed to find their source code. So, there is no chance for me to repurpose
or fork an existing tool to enable people to compare mobile applications.
14 / 48
Android Applications
What is Android application?
From a technical perspective, an Android application (APK) is essential a file in the ZIP format in
which multiple files can be compressed and archived into a single one (Google, 2018a). APK files can
be recognised correctly as a ZIP file by archive programmes after renaming the extension “.apk” to
“.zip”. For the study of the contents in an APK, however, this decompression method would make
some of the files unreadable because of encoding issues (Tumbleson, 2018). To resolve the
encoding issue, we may use APKTool to decompress APK files and decode the files at the same time
(Tumbleson, 2018).
Figure 2. Demonstration: I used Linux commands to extract contents from an APK after naming
the APK file to a ZIP file. Some of the files extracted from the ZIP file became unreadable because
of encoding issue. For example, the content of AndroidManifest.xml (supposedly a text file) could
not be read correctly as plaintext.
15 / 48
Figure 3. Demonstration: I used APKTool to extract and decode files an APK. The content of
AndroidManifest.xml could now be read in plaintext.
As shown in figure 3, an APK, as extracted using APKTool, contains the following files and directories:
AndroidManifest.xml This file contains the application’s ID, application version, a list of
permissions and screen rotation settings defined by Google (2018b).
smali This directory contains files of the code (Dalvik bytecode) of the application in
disassembled in a language called Smali.
res and assets These directories contain multimedia and setting files that the application
may use.
As its name implies, an APK file is a package that contains many objects. To develop a tool to
discover Android application similarities, we have to consider the ontological characteristics of each
type of objects in this package. An APK is broken down to a wide range of type files, for example,
Dalvik bytecode, images, and XML configurations which contain programme instructions, pictures,
data attributes respectively. Indeed, Android application contains a variety of digital objects that
require different ways to identify similarities. Narrowing down to one type of objects or blending
them together would incur a loss of knowledge. In light of Roger’s advice to “follow the medium”,
in the case of APKs, we need to follow the ontology of each type of media in the medium. In other
words, technically speaking, I am breaking down the problem of comparing APKs into the problems
of comparing code similarity, image similarity and the intersections of identical files and data
attributes.
16 / 48
Types of Android applications, from software developers’ perspective
In terms of approaches to developing mobile applications, there are three types of applications that
developers may consider as follows (Craigmile, 2015; Budiu, 2013; Dua, 2018). The choice of
approach is important for developers who wish to publish an application to more than one platform.
Native application Developers use the tools preferred by Google to write code to invoke
Android APIs directly. For the avoidance of doubt, the word “native” here does not mean
“machine code” as it is usually used. It means the mobile applications developed using
methods recommended by the platforms, such as writing in Objective-C or Swift for iOS
applications (Apple, 2018) and Java for Android applications (Google, 2018e) .
Native” applications have better performance and provide better support for the platform.
However, the disadvantages are a lack of code portability and a steep learning curve.
Applications written as a “native” application are usually exclusive for one platform. If a
develop wishes to make the application work on another platform, for example iOS, he or
she needs to develop an entirely separate iOS application.
Hybrid application Developers use tools that allow some degree of reuse of code between
the code for iOS and Android versions of the same application. Tools such as Appcelerator
Titanium, Xamarin, Ionic and Reactive Native function as a middleware (or an interface)
between Android and intermediate code (Jscrambler, 2017). The middleware from these
tools works as a runtime on Android and iOS to interpret and execute developers’ non-
native code. Writing a hybrid application reduces the time needed for publishing a mobile
application to multiple platforms. However, the overhead of the middleware makes hybrid
applications usually run slower than native applications.
Web-view based application Developers use tools that provide a web browser component,
usually called “web view”, as a wrapper around web pages (Google, 2018c; 2018o).
Developers might just write mobile applications as they were writing web pages using HTML,
CSS and JavaScript. So, web developers may easily train themselves to become a mobile
application developer. Web-view applications provide the best cross-platform compatibility
for maximising the ratio of code sharing though, the performance of this type of applications
is usually the worst in comparison to the previous two.
The native to hybrid to web-view types is a spectrum rather than rigid definitions (Morony, 2015).
Developers choose the right approach based on the considerations of development time, cost, cross-
platform needs and performance expectation (Morony, 2015).
For the purpose of this dissertation project, Appcestry is designed to help discover similar native
applications. Although Appcestry may also enable the detection of similar hybrid and web-view
application, due to the existence of multiple middleware frameworks with varying complexities, in
the interest of the time allowed for this project, hybrid and web-view applications are not in the
scope of Appcestry at this moment.
17 / 48
Extraction of Features from Objects in Android Applications
Software code
Types of software clones
Programme code, as a medium for instructions to machines, is an indispensable part of a software
application. Methods were developed to detect code similarity, code plagiarism or “software clone”
in general. Roy, Cordy and Koschke (2009) proposed four types of software cloning:
Type-1: Using the identical code, but with different number of whitespaces
Type-2: Using syntactically, or structurally, the same code but with the names for variables
and functions altered
Type-3: Using the same code but with some statements added or removed
Type-4: Rewriting entirely different code to perform the same function
Based on the discussion above on the practice of software reuse in the industry, I assume that
developers would recycle code from previous projects or open source projects. Also, to make the
copied code more understandable in a new project, developers may change the names and values
for variables and functions to adapt to the needs of new requirements. Developers may also make
tiny changes to the copied code to suit the logic of the new project. Therefore, Appcestry should be
able to detect similar programmes of Type-1, Type-2 and Type-3.
For Type-4, the efforts to reimplement an existing function would defy the purpose of software
reuse to save costs. Also, Type-4 clones are syntactically different but functionality identical which
cannot be told by code analysis alone. Detection of Type-4 may involve the examination of inputs
and outputs of functions allegedly cloned. Therefore, similar applications of Type-4 are not in the
scope of Appcestry.
Android Architecture
Knowledge of Android’s architecture is instrumental in deciding on the best way to extract features
from the code in Android applications. The Android operating system supports two families of
processors architectures x86 and ARM (previously the acronym for “Advanced RISC Machine”)
(Google, 2018d). X86 is an architecture developed and adopted by Intel and AMD. The processors
in most desktop, laptop and server computers are of the X86 architecture. ARM is an architecture
developed by Arm Holdings and is licenced to a number of processor manufactures (Stevenson,
2018). Processors based on ARM architecture are at the core of the majority of mobile devices.
Qualcomm’s Snapdragon series, Samsung’s Exynos and Apple’s A series are based on ARM (Kelion,
2016). Generally speaking, a software programme built for X86 is not compatible with ARM and vice
versa. Usually, if a software developer wants a programme to work on both X86 and ARM
processes, the develop must compile the source code into two versions of executable files which
contain architecture-specific machine instructions called “native code” or “machine code” (Sims,
2014).
However, developers of Android applications are safe to be ignorant of the architecture differences
because Android has adopted the model of bytecode”. Bytecode is processor architecture-
18 / 48
independent. Source code compiled into bytecode will run regardless of the underlying processor
architecture, provided that a “runtime environment”, also called “virtual machine”, is available for
the operating system and process architecture in question. For Android, Google implemented a
runtime environment called “Dalvik” to execute “Dalvik bytecode” (Levin, 2015). Developers may
just use the development tool made by Google to compile source code in Java into Dalvik bytecode
to be packaged into an APK (Levin, 2015).
We must also take note of the legal dispute between Google and Oracle over “Java technologies”.
Java is more than a programming language. Java was first conceived as a language capable of write
once, run anywhere (Oracle, 2017a). To write a programme in Java, one must learn Java’s class
libraries and run the programme on Java Virtual Machine (JVM). JVM is the first and the major
runtime environment. Source code written in Java is compiled into Java bytecode. The Java
bytecode files are executable “everywhere” as long as the JVM supports the operating system and
processor architecture (Oracle, 2017c). After Oracle Corporation’s acquisition of Sun Microsystems,
the Java language, the Java class libraries and the JVM became the property of Oracle. Later, Oracle
partially open-sourced the JVM but not the Java class libraries. However, the Java class libraries
remain proprietary (Levin, 2015).
Google used the Java language and the structure of Java class libraries for Android’s runtime
environment Dalvik. Google copied the “structure” of Java class libraries but not the actual code of
Oracle’s Java class libraries and the JVM. In Google’s implementation, Android applications written
in Java are compiled into a new of set bytecode, called “Dalvik bytecode”, instead of Java bytecode.
The use of the structure of Java class libraries was the focal point in the lawsuit between Oracle and
Google (Bright, 2016; Mullin, 2016; Farivar, 2018). So, the design of the runtime environment in
Android from the very first beginning was heavily shaped by the design of Java technologies.
For Android-specific features, Google implemented Android Platform API for the developers.
Common tasks such as showing a user interface (UI), playing a video and access to cameras require
invocation of the Android Platform API (Google, 2018f).
The architecture of Android, especially the incorporation of Java technologies in the runtime
environment Dalvik, is important in studying Android applications’ code-level similarity. From the
discussion above, we know that the Java class libraries and Android APIs are the foundation for an
Android application. In other words, an application has to invoke the data types, objects and
functions from Java class libraries and Android API to work. Taking this characteristic into account,
in applications with high code similarity, the order of the invocation of the elements from the Java
class libraries and the Android API should also be similar.
Baker and Manber (1998) first attempted to compare Java bytecode similarities. They proposed
that a direct comparison of compiled bytecode would not be ineffective because a minor change in
source code may cause references to line numbers to shift significantly in the process of compilation.
Drawing on Baker and Manber’s experience, Java bytecode needs to be transformed into an
assembly language called “Jasmin” which can represent machine instructions in human-readable
text for analysis (Mever, 1996).
Despite the fact that Dalvik bytecode does not run on JVM and vice versa, Dalvik bytecode shares
some similarity with Java bytecode. Assembly language “Smali” was developed on the basis of
19 / 48
Jasmin (Gruver, 2017) to represent Dalvik bytecode in human-readable form. Although Dalvik
bytecode and Java bytecode work differently in handling data (Spreitzenbarth, 2012), I found that
the procedural controls and arithmetic operations of the in Smali and Jasmin are more or less the
same. I believe that the methods used to compare Jasmin similarity would be easily adapted to
Smali.
Analogous to Baker and Manber’s (1998) study, to analyse Dalvik bytecode, a transformation of
bytecode files into a human-readable format is needed. Fortunately, this transformation is built
into the APKTool. APKTool also acts as a “disassembler” that rewrites the machine instructions into
Smali. Indeed, there are also tools to transform Dalvik bytecode to Jasmin. However, Arnatovich,
Wang, Ngo and Soh (2018) suggested that extraction into Smali by APKTool had the highest rate in
code preservation.
Pre-processing of code
According to Roy, Cordy and Koschke (2009), we should not include all parts of code in clone
detection because there are parts of code that may cause “false positives”. We should narrow
down the scope to the parts of the code that would be conducive to effective detection of software
clones. Especially, they advised filtering out generated code.
In light of Roy, Cordy and Koschkes advice, I decided that Appcestry should extract only the parts of
the code that are mostly likely to be the work of developers of Android applications. An APK
contains not only the developers’ own code but also automatically generated code from resources
and some code from linked libraries. Although the libraries that applications share are also of our
concerns, we will deal with “common libraries in the namespace section more effectively which I
will discuss later. Since the code from linked libraries and resources are neither created nor edited
by the applications’ developers, I made Appcestry to “select” filter out the code not of our interests.
For every Android application to be published on Google Play, Google requires developers to assign
a unique application ID that “looks like a Java package name. The same application ID must be
maintained in all versions of an Android application (Google, 2018g). In the explanation on the
naming of Java packages by Oracle (2017b), for example, an individual developer or an organisation
whose domain name is “jasontc.netis developing an application named “myapp”, the ID for that
application should be “net.jasontc.myapp”.
A single Android application may contain a large number of classes. A class in object-oriented
programming languages can be seen as an encapsulation of data types and methods (or functions).
To organise them, programmers usually follow Oracle’s (2017b) Java package name convention to
put the classes inside hierarchical namespaces that are arranged like an inverted form an Internet
domain name.
Before going into the selection method, I would like to discuss the directory structure for Smali
disassembled by APKTool. I found that APKTool saves disassembled “classes”, or “objects” in an
object-oriented programming language, as individual files under directories prefixed by “smali”.
Under the “samli” directories, APKTool creates a directory for each namespace and stores the files
for the classes in the directories of their respective namespaces. For example, if a Java class is
named “myClass” under namespace “net.jasontc.myapp”, the file “myClass” would be stored under
“/net/jasontc/myapp” where “/” denotes a directory level.
20 / 48
You may notice that in the above samples, the application ID and the namespace happened to be
the same. Indeed, by default, Android Studio creates a namespace identical to the application ID for
developers to put their own code inside (Google, 2018g). Thus, further to the preceding examples,
the Smali files that are found to be under the directory “/net/jasontc/myapp are very likely to be
written by the developers of “myapp”. I can assume that all code files outside the closest
namespaces are quite likely from linked libraries, such as trackers. Yet, Google (2018g) reminded us
that putting custom code under the namespace as that of application ID is a recommended practice
but not mandatory.
Taking these facts into account, I designed the following rules for Appcestry to select the code of
interest:
1. Appcestry checks the existence of the directory representation the whole application ID. If
the directory is present, Appcestry considers the files inside are the code of our interest.
Otherwise, Appcestry would move a level up the hierarchy of the application ID to check if a
corresponding directory is present until one is found or the leftmost level or the ID is
reached. In other words, Appcestry only includes the code files that are inside the closest
namespace of the application ID;
2. Appcestry filters out app files with names started with “$” as they are automatically
generated from resources; and
3. To make the later comparison more efficient, Appcestry orders the remaining Smali files by
size from largest to smallest and takes only the first 5 Megabytes of files. The larger a file
the more logic and operations it contains. It is reasonable to believe that longer code may
require more efforts from the developers.
21 / 48
Figure 4. Flowchart: Selection of Smali code files of interest
22 / 48
Parsing
I chose Another Tool for Language Recognition (ANTLR) as the library to parse, or to read, the Smali
files. ANTLR is a tool developed by Parr (2013) for computer programmes to work on the semantic
structures of programming languages as well as natural languages. For each language, ANTLR
requires a grammar file with extension .g4 to generate a “walker” programme. Open source
project dex2jar (Pan, 2018) hosts an ANTLR grammar file for the Smali language. I used the grammar
file for Smali from dex2jar to request ANTLR to generate a “walker” programme in Python for
Appcestry to “navigate” the instructions in Smali files.
Tokenisation
Tokenisation process is to decide how the original language is encoded to prepare for further
processing by matching and machine language algorithms. In this process, Appcestry decides what
instructions or what elements in an instruction to preserve, remove and recode.
According to Roy, Cordy and Koschke (2009), approaches to tokenisation of programme languages
for similarity detection are:
Text-based approach Treating programme code as text fragments with little or no
transformation
Token-based approach Transforming lines of programme code to a sequence of “tokens”
Tree-based approach Transforming the structure of programme code to a tree
Metrics-based approach Obtaining metrics from programme code
Graph-based approach Transforming the relations between functions by inter-function calls
to a graph (Gabel, Jiang & Su. 2008)
The text-based approach is suited for Type-1 clones (programmes with very high similarity) but not
Type-2 and Type-3 clones. (See section “Types of software clones for the definitions of the types).
Metrics-based, graph-based and tree-based approaches are structural and statistics methods which
are suited for cases in which all code can be taken into account. They are not applicable to our case
because Appcestry does not seek to compare all parts of the code but endeavours to filter out the
code not written by the application developers.
In light of the methods outline in the studies by Baker, Manber (1998), Roy, Cordy and Koschke
(2009) considering the characteristics of Smali as a low-level language, I decided to adopt the token-
based approach. In the compilation process, each high-level, or more complex, instruction is
translated into a sequence of low-level, or more primitive, instructions. Also, the format of low-
level instructions is concise and rigid thus easier to be tokenised. Considering low-level instructions
as a sequence of tokens makes sense in the discovery of similarity. Nonetheless, Roy, Cordy and
Koschke (2009) suggested that the token-based approach would be effective for detecting Type 1,
Type 2 and Type 3 clones.
The format of each Dalvik instruction is an operation code (opcode) followed by parameters (Google,
2018h), like:
Operation code (Opcode) parameter-1 parameter-2 and so on…
23 / 48
In the tokenisation process, I needed to decide what instructions, as determined by the opcode, to
keep or discard. If I choose to keep an instruction, should I convert the values of the parameters
into symbols or keep their original values.
For instructions, in the eyes of programmers of high-level languages, low-level operations are
tedious and sometimes redundant. As one high-level operation always be translated to the same set
of low-level operations. In Dalvik bytecode, for example, after declaring a method, we must follow
it by instructions to load the parameters into registers. So, the load operations here (Google,
2018i), for the purpose of comparing programme similarity, can be seen as redundant thus may be
removed. I decided that Appcestry keeps the instructions concerning the declaration of class and
methods, method invocation, arithmetic operation and procedural control.
For parameters, parameters may be of data types and methods that are defined in the Java class
libraries, Android API or created by the programmers. The names for types and methods in Java
class libraries and Android API are stable because they are in the foundation of the Android
operating system. In contrast, the data types and methods written by programmers, which I call
“custom types and methods”, may be arbitrarily named and easily changed. So, the names for
custom types and methods should be “tokenised” or generalised as symbols. Simply put, I decided
to keep the original values of parameters of references to Java class libraries and Android API and
rewrite all references to custom types and methods to tokens. To be precise, based on the list of
namespaces for Android API (Google, 2018j), all references started with “java.” and “android.” are
from the Android operating system. References matching this criterion is left untouched. All other
references are tokenised.
24 / 48
Figure 5. Flowchart: Tokenisation of Smali code
25 / 48
Namespaces
As discussed in the section on Android architecture, classes are arranged under namespaces.
Adding a third-party library to an Android application also brings the namespaces of that library to
the code. By scanning the directory structure containing the Smali code, we can obtain a list of
namespaces that exist in an application. Namespaces give a good picture of common libraries
amongst applications.
The intersection of namespaces reveals the extent of how Android applications are similar in terms
of the underlying technologies and services. Therefore, I designed Appcestry to exact a list of
namespaces by scanning the directories prefixed by “smali”.
Android permissions
The permissions govern an applications’ level of access to the users’ data, devices’ sensors and
system functions. In the file AndroidManifest.xml, there is a section for the declaration of a list of
permissions that developers would like to request from the users and Android. There are two types
of permissions: Android system permissions and custom permissions (Google, 2018k).
Android system permissions are linked to Android API and are classified into several categories. For
example, if an application would like to invoke the Android API to capture an image using the
device’s camera, the application needs to obtain the Android system permission called
android.permission.CAMERA” from the user. All Android system permissions are prefixed with
“android”. (Google, 2018l) Google also allows programmers to add custom permissions for use by
third party libraries. Due to the arbitrariness of custom permissions, custom permissions should be
handled separately from Android system permissions.
Shared permissions may reveal how applications are similar in terms of control of the users’ device.
Therefore, I designed Appcestry to extract both system permissions and custom permissions but
stores them under different names.
XML files
Extensible Markup Language (XML) is a format for exchange of data that provides for a structure and
strongly typed formats to store data. XML elements are wrapped in “<” and “>” sings. (World Wide
Web Consortium, 2006) Multilingual expressions on user interfaces, parameters, settings are
usually stored in XML files to be read by applications (Google, 2018m).
The contents of the XML files may reveal how applications are similar in terms of shared
configurations. Therefore, I designed Appcestry to extract the names and values for the attributes
from all XML files.
Images
Icons and user interface elements and contents are saved as image files in an application.
Developers may reuse these resources across applications. The share of similar images may reveal
how resources are recycled in mobile applications.
However, image files allowed in an APK may be in various formats (Google, 2018n). Images that are
perceived by humans as similar may be very different on the level of encoded data in an image
26 / 48
format. Files of the same picture in slightly different sizes may result in a dramatically different
presentation in the encoded data. Drawing on the techniques used in industry for copyright
enforcement, I apply the method of perceptual hash (p-Hash) for discovering similar images
(Zauner, 2010). Generally, speaking hash algorithms are seen as a way to “scramble” data to an
irreversible data usually in a fixed length (Gibbs, 2016). Hash functions are widely used in error
detection. Perceptual hash is an algorithm which similar images would be transformed into a set of
similar data. Near duplicate images that may be stored in different formats, of different sizes, in
different rotations would be hashed into the very values (Klinger, 2010).
For the detection of near image duplicates in Android applications, I designed Appcestry to gather a
list of p-Hashes from image files found in an APK.
Figure 6. Demonstration: I used the online demo of the p-Hash library used by Appcestry to
compare images with rotational and colour changes. The p-Hash algorithm here suggests the
images are similar.
27 / 48
All other files
An APK also contains files in a variety of formats. Although there are specific ways to detect
similarity for each type of the files, the types not discussed above are optional in an application and
their types are non-exhaustive. I decided to adopt a general method to detect their similarity
duplicate detection or file integrity check. Based on the feature of bash functions that a change in a
single bit of input data leads to a radical change in the output hash, it has been considered a robust
way to discover files of exact duplicates and to verify transmitted data.
The share of identical files may reveal how resources are reused in the development of mobile
applications. So, for all other files (non-Dalvik bytecode, non-XML and non-images) in an APK,
Appcestry uses hash function SHA256 (NIST, 2015) to extract a list of hash values for identification of
identical files.
Figure 7. Demonstration: I used the command-line Python interpreter to show how a change of a
single character in input data may result in a radically different output of hash value. The very
same hash function is used by Appcestry. In this example, the hyphen - in the first string was
changed to a colon “:” as in the second string. The hash values beneath the strings were very
different.
AppGene format
Appcestry obtains tokenised Smali code, a list of namespaces, lists of values and names from XML
attributes, a list of p-Hashes of images and a list of SHA256 hash values of files from an APK. These
features are saved in a format which I name “AppGene”. AppGene is a representation of the
features extracted from Android applications using the methods described in this chapter.
Appcestry converts each APK to an AppGene file. The AppGene format is based on the specification
of JavaScript Object Notation or JSON (ECMA, 2017) which is an open format widely used in the
exchange of data. The AppGene files generated by Appcestry have interoperability with other
programming languages and machine learning tools.
28 / 48
Comparing AppGenes for Similarity
1 to 1 comparison
Appcestry has the feature to allow for 1 to 1 comparison of AppGenes.
Drawing on Baker and Manber’s (1998) method of comparing disassembled and tokenised Java
bytecode, Appcestry invokes the diff programme on the Linux system to compare code difference.
The diff programme provides the similarly of documents in percentage (IEEE & The Open Group,
2018).
For all other features (namespaces, XML attribute names and values, permissions, perceptive hashes
and SHA256 hashes), Appcestry uses Scikit-learn (Pedgregosa et al., 2011), a popular machine
learning library for Python, to calculate the Jaccard similarity (the intersection over union) of the
features between applications.
Users need to obtain the APK files of Android applications to be studied before using Appcestry. I
do not elaborate on the possible ways to retrieve APK files because the facilitation of downloading
Android applications is not in the scope of this dissertation project.
29 / 48
Usage for general users
Figure 8. Usage for general users step 1: Drop APK files to the web-based interface of Appcestry
for AppGene conversion
Figure 9. Usage for general users step 2: Click the “Download” link on the right to download all
AppGenes in a single ZIP file. Or, click the individual links on the left to download AppGene files
separately
30 / 48
Figure 10. Usage for general users step 3: Drop zipped AppGene files or individual AppGene files
to the web-based interface of Appcestry for AppGene comparison
Figure 11. Usage for general users step 4: Read or download the result of the comparison
31 / 48
Usage for advanced users
If automated processing is needed, users may programmatically send APKs to the Appcestry’s web
service in batch. The following example is using Appcestry through the command “curl a tool
widely used by network technicians or programmers to transfer data.
Figure 12. Usage for advanced users step 1: Use curl to send an APK to the Appcestry’s AppGene
conversion service. Then, download the AppGene file after successful conversion.
32 / 48
Figure 13. Usage for advanced users step 2: Use curl to send AppGene files to the Appcestry’s
AppGene comparison service. Then, download the result file after a successful comparison.
33 / 48
Clustering
Appcestry does not provide a built-in programme for clustering. Since, data science operations
usually involve experimentation, I consider clustering an advanced feature not implemented for this
dissertation project in the interest of time. To demonstrate AppGene’s potential use in data mining
tasks, here, I used Scikit-learn to discover clusters of AppGens.
For the avoidance of doubt, data mining is a discipline to discover knowledge from data. There are
two main types of machine learning algorithms supervised learning, which does not need training
data before use, and unsupervised learning, which requires training data to build models before use
(Han, Pei & Kamber, 2011). Here, I adopt unsupervised learning to find groups of applications by
similarities features.
For the choice of algorithm, a well-known algorithm k-means is not recommended because the
number of groups is unknown. I chose algorithm DBScan because as it requires no predefined
number of clusters. However, the use of DBScan requires setting parameters to specify the
“closeness” of data points in a cluster.
34 / 48
Figure 14. Clustering: Use unsurprised learning algorithm DBSCAN in Scikit-learn to discover
clusters of AppGenes
35 / 48
Design and Architecture of Appcestry
In the line with Marres’ observation that social methods can be multifarious, Appcestry is designed
to enable both ordinary users and technical people to use. Therefore, Appcestry provides two
interfaces: a web-based user interface and API. As discussed in the previous section, ordinary users
may access Appcestry using a web browser while developers may interact with Appcestry using
custom programmes or commands in batch.
Appcestry is written in Python a programming language and environment known for its popularity
in the science and machine learning community (Patel, 2018).
In the era of Big Data, technologies are being developed to handle massive datasets. Google’s
MapReduce and BigQuery are one of the examples of distributing the workload of data processing to
a cluster of computers made of commodity hardware (Tigani and Naidu, 2014). Also,
Containerisation is one of the trends for developing scalable and highly-available services (Merkel,
2014). Drawing on the design philosophy of distributed computing, Appcestry is designed for work
on a single computer but also on multiple computers in parallel. Following the container paradigm,
Appcestry is split into containers by functionality. Containerisation is a trend in data analytics
operations. Popular tools such as Spark and Hadoop may be run on containers (Shivaprasad &
Muralidharan, 2018).
Figure 15. The architecture of Appcestry in container services
Logos of NGINX, Flask, Redis, Redis Queue and DASK shown figure in the figure are for indicative
purpose only. They belong to their respective owners.
Potential Use Cases
The following samples demonstrate what knowledge researchers can learn about mobile media
ecology with the help of Appcestry.
36 / 48
Use Case 1: University apps
It is observed that too many universities publish mobile applications that look alike and provide
similar features. Are they indeed developed by the same entity who has recycled the code?
Through Appcestry, I found that despite the taboo of reuse in academia, the spirit of originality is not
held in the universities’ practice of publishing mobile applications under their own names.
Figure 16. Mobile applications published by major UK universities are technically inside.
37 / 48
Figure 17. The prefix in their application IDs suggests that the developments of these University
applications were outsourced to the same developer.
38 / 48
Use Case 2: Quit-Pornography apps
Helmond et al. (2018b) at the 2018 Digital Methods Summer School of the University of Amsterdam
wanted to investigate the ecology of “abortion apps” in reaction to the reports that Google
discriminates applications that offer information on abortion in favour of those of pro-life cause.
Although Appcestry cannot provide a straightforward answer to questions around Google Play
queries, the use of Appcestry helped extract more knowledge from the same set of data.
Using the same list of applications that the researchers used in their study, I downloaded the APKs of
all these applications and used Appcestry to discover similarity. It revealed relations that could not
have been found without Appcestry. With the help of Appcestry and Scikit-learn, I found that some
applications, despite published under different names, were very likely published by the same
developer. For example, I made an interesting discovery that that the same developer may have
recycled a same Android application to a diverse genre of applications ranging from advice on
quitting pornography advice to the holy bible to erectile dysfunction on Google Play.
39 / 48
Figure 18. Technically similar applications in interesting genres by the same developer
40 / 48
These samples also show that Appcestry cannot answer a specific social question or test a
hypothesis, rather, it allows people to interrogate the true of the mobile ecology. It is designed with
a mind to explore knowledge and test a variety of theories that may be unknown to us at the
moment. The methods enabled by Appcestry should fall into the scope of abduction inference.
Further development
If resources permit, in the future, Appcestry shall incorporate data from the application markets,
namely, title, description and screenshots to facilitate clustering. Furthermore, currently, the
discovery of groups in conjunction with machine learning algorithms appears an important feature
but at this stage its use requires programming work. Ways to open the clustering capabilities to
ordinary users shall be researched.
Conclusion
This dissertation presents the tool Appcestry to enable new methods to be developed for people to
study the similarities of Android applications, leading to knowledge about the ecology of mobile
media. Appcestry responds to the problem of comparing APKs by breaking down to comparing the
objects inside an APK using methods suited to the ontological characteristics of their types.
Appcestry is designed to be useable by ordinary users and programmers in the hope that tools for
social methods are multipurpose and multifarious.
41 / 48
References
9to5Mac. (2011). Jobs’ original vision for the iPhone: No third-party native apps. Retrieved from
https://9to5mac.com/2011/10/21/jobs-original-vision-for-the-iphone-no-third-party-native-
apps/
Ahtiainen, A., Surakka, S., & Rahikainen, M. (2006). Plaggie: GNU-licensed source code plagiarism
detection engine for Java exercises. In Proceedings of the 6th Baltic Sea conference on
Computing education research: Koli Calling 2006 (pp. 141-142). ACM.
Aiken, A. (2010). A System for Detecting Software Similarity [Computer software]. Retrieved from
https://theory.stanford.edu/~aiken/moss/
Anderson, C., & Wolff, M. (2010). The Web is dead. Long live the Internet. Wired Magazine, 18.
Android Open Source Project. (2018). About the Android Open Source Project. Retrieved from
https://source.android.com/
Apple. (2018). Swift 4. Retrieved from https://developer.apple.com/swift/
Arnatovich, Y. L., Wang, L., Ngo, N. M., & Soh, C. (2018). A Comparison of Android Reverse
Engineering Tools via Program Behaviors Validation Based on Intermediate Languages
Transformation. IEEE Access, 6, 12382-12394.
Baker, B. S., & Manber, U. (1998). Deducing Similarities in Java Sources from Bytecodes. In USENIX
Annual Technical Conference (pp. 179-190).
Bright, P. (2016). The Google/Oracle decision was bad for copyright and bad for software.
ArsTechnica. Retrieved from https://arstechnica.com/information-technology/2016/06/the-
googleoracle-decision-was-bad-for-copyright-and-bad-for-software/
Budiu, R. (2013). Mobile: Native Apps, Web Apps, and Hybrid Apps. Retrieved from
https://www.nngroup.com/articles/mobile-native-apps/
Campbell, S. W. (2013). Mobile media and communication: A new field, or just a new journal?
Mobile Media & Communication, 1(1), 8-13.
Chen, N., Hoi, S. C., Li, S., & Xiao, X. (2015). SimApp: A framework for detecting similar mobile
applications by online kernel learning. In Proceedings of the Eighth ACM International
Conference on Web Search and Data Mining (pp. 305-314). ACM.
Chen, N., Lin, J., Hoi, S. C., Xiao, X., & Zhang, B. (2014). AR-miner: mining informative reviews for
developers from mobile app marketplace. In Proceedings of the 36th International Conference
on Software Engineering (pp. 767-778). ACM.
Chen, W. (2015). A moveable feast: Do mobile media technologies mobilize or normalize cultural
participation? Human Communication Research, 41(1), 82-101.
Craig, P. (2005). Software piracy exposed. Syngress Publishing.
42 / 48
Craigmile, N. (2015). Mobile App Platforms: Hybrid, Native, Mobile Web. APP DEVELOPMENT
INDUSTRY INSIGHT. Clutch. Retrieved from https://clutch.co/app-
developers/resources/mobile-app-platforms-hybrid-native-mobile-web
Dua, K. (2018). A Guide to Mobile App Development: Web vs. Native vs. Hybrid. Retrieved from
https://clearbridgemobile.com/mobile-app-development-native-vs-web-vs-hybrid/
ECMA. (2017). The JSON Data Interchange Syntax. Standard ECMA-404. Retrieved from
http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
Erickson, S., & Kelty, C. M. (2015). The Durability of Software. There is no Software, there are just
Services, 39.
Evans, D. S., Hagiu, A., & Schmalensee, R. (2008). Invisible engines: how software platforms drive
innovation and transform industries. MIT press.
Farivar, C. (2018). “Google’s use of the Java API packages was not fair,” appeals court rules.
ArsTechnica. Retrieved from https://arstechnica.com/tech-policy/2018/03/googles-use-of-the-
java-api-packages-was-not-fair-appeals-court-rules/
Ford, J. B. (2017). What Do We Know About Mobile Media and Marketing? Journal of Advertising
Research Sep 2017, 57 (3) 237-238; DOI: 10.2501/JAR-2017-032
Frakes, W. B., & Kang, K. (2005). Software reuse research: Status and future. IEEE transactions on
Software Engineering, 31(7), 529-536.
Frakes, W., & Terry, C. (1996). Software reuse: metrics and models. ACM Computing Surveys (CSUR),
28(2), 415-435.
Fuller, M. (2008). Introduction, the Stuff of Software. Software studies: A lexicon. MIT Press.
Gabel, M., Jiang, L., & Su, Z. (2008, May). Scalable detection of semantic clones. In Proceedings of
the 30th international conference on Software engineering (pp. 321-330). ACM.
Garrett, J. (2005). Ajax: A New Approach to Web Applications. Retrieved from
http://adaptivepath.org/ideas/ajax-new-approach-web-applications/
Gates, W. H. (1976). An Open Letter to Hobbyists.
Gazdecki, A. (2018). How To Become A Mobile-First Agency. Forbes. Retrieved from
https://www.forbes.com/sites/forbestechcouncil/2018/03/29/how-to-become-a-mobile-first-
agency/
Gerlitz, C. et al. (2016). App support ecologies: An empirical investigation of appplatform relations.
Infrastructures of Publics Publics of Infrastructures, First Annual Conference 2016 of the DFG
Collaborative
Gerlitz, C., & Helmond, A. (2013). The like economy: Social buttons and the data-intensive web. New
Media & Society, 15(8), 1348-1365.
43 / 48
Gibbs, S. (2016). Passwords and hacking: the jargon of hashing, salting and SHA-2 explained. The
Guardian. Retrieved from https://www.theguardian.com/technology/2016/dec/15/passwords-
hacking-hashing-salting-sha-2
Gilbert, P., Chun, B. G., Cox, L. P., & Jung, J. (2011, June). Vision: automated security validation of
mobile apps at app markets. In Proceedings of the second international workshop on Mobile
cloud computing and services (pp. 21-26). ACM.
Gonzalez, H., Stakhanova, N., & Ghorbani, A. A. (2014). Droidkin: Lightweight detection of android
apps similarity. In International Conference on Security and Privacy in Communication Systems
(pp. 436-453). Springer, Cham.
Google. (2016). Android Software Development Kit License Agreement. Retrieved from
https://developer.android.com/studio/terms
Google. (2018a). Analyze your build with APK Analyzer. Android Developers. Retrieved from
https://developer.android.com/studio/build/apk-analyzer
Google. (2018b). Documentation: <manifest>. Android Developers. Retrieved from
https://developer.android.com/guide/topics/manifest/manifest-element
Google. (2018c). WebView for Android. Retrieved from
https://developer.chrome.com/multidevice/webview/overview.
Google. (2018d). CPUs and Architectures. Retrieved from
https://developer.android.com/ndk/guides/arch
Google. (2018e). Android Studio release notes. Retrieved from
https://developer.android.com/studio/releases/
Google. (2018f). Developer Guides. Retrieved from https://developer.android.com/guide/
Google. (2018g). Set the application ID. Retrieved from
https://developer.android.com/studio/build/application-id
Google. (2018h). Dalvik Executable instruction formats. Retrieved from
https://source.android.com/devices/tech/dalvik/instruction-formats
Google. (2018i). Dalvik bytecode. Retrieved from
https://source.android.com/devices/tech/dalvik/dalvik-bytecode
Google. (2018j). Package Index. Retrieved from https://developer.android.com/reference/packages
Google. (2018k). Permissions overview. Retrieved from
https://developer.android.com/guide/topics/permissions/overview
Google. (2018l). Manifest.permission. Retrieved from
https://developer.android.com/reference/android/Manifest.permission
44 / 48
Google. (2018m). Support different languages and cultures. Retrieved from
https://developer.android.com/training/basics/supporting-devices/languages
Google. (2018n). Reducing image download sizes. Retrieved from
https://developer.android.com/topic/performance/network-xfer
Google. (2018o). Getting Started: WebView-based Applications for Web Developers. Retrieved from
https://developer.chrome.com/multidevice/webview/gettingstarted
Graziano, D. (2013). How to install apps outside of Google Play. Retrieved from
https://www.cnet.com/how-to/how-to-install-apps-outside-of-google-play/
Gruver, B. (2017). About. Retrieved from
https://github.com/JesusFreke/smali/blob/master/README.md
Haefliger, S., Von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software. Management
science, 54(1), 180-193.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
Helmond, A. (2015). The platformization of the web: Making web data platform ready. Social Media+
Society, 1(2), 2056305115603080.
Helmond, A. et al. (2018a). Mapping Data-Intensive App Infrastructures. Digital Methods Initiative
Winter School 2018. Retrieved from
https://wiki.digitalmethods.net/Dmi/WinterSchool2018MappingDataIntensiveAppInfrastructur
es
Helmond, A. et al. (2018b). App Stores and Their Bias: Repurposing ‘App Relatedness’? Retrieved
from
https://wiki.digitalmethods.net/Dmi/SummerSchool2018AppStoresBias#Research_Questions
Hill, S. (2016). Tired of Google Play? Check out these alternative Android app stores. Retrieved from
https://www.digitaltrends.com/mobile/android-app-stores/
Huang, H., Zhu, S., Liu, P., & Wu, D. (2013). A framework for evaluating mobile app repackaging
detection algorithms. In International Conference on Trust and Trustworthy Computing (pp.
169-186). Springer, Berlin, Heidelberg.
Hunt, A., & Thomas, D. (1999). The Evils of Duplication. The Pragmatic Programmer.
IEEE and the Open Group. (2018). IEEE Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008). The
Open Group Base Specifications Issue 7, 2018 edition. Retrieved from
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/diff.html
Ilyas, M. & Ahson, S. (2006). Smartphones: Research Reports. International. Engineering
Consortium.
Jscrambler. (2017). 12 Frameworks for Mobile Hybrid Apps. Retrieved from
https://blog.jscrambler.com/10-frameworks-for-mobile-hybrid-apps/
45 / 48
Kabali, H. K., Irigoyen, M. M., Nunez-Davis, R., Budacki, J. G., Mohanty, S. H., Leister, K. P., & Bonner,
R. L. (2015). Exposure and use of mobile media devices by young children
Kaldrack, I., & Leeker, M. (2015). There Is No Software, there Are Just Services: Introduction. There Is
No Software, There Are Just Services, 9-20.
Kelion, L. (2016). What is ARM and why is it worth £24bn? Technology. BBC News. Retrieved from
https://www.bbc.co.uk/news/technology-36826095
Klinger, E. (2010). pHash: The open source perceptual hash library [computer software]. Retrieved
from https://www.phash.org
Laplante, P. A. (2007). What every engineer should know about software engineering. CRC Press.
Lavigne, F. (2017). Secure your mobile serverless backend with App ID. IBM Cloud Blog. Retrieved
from https://www.ibm.com/blogs/bluemix/2017/12/secure-mobile-serverless-backend-app-id/
Levin, J. (2015). Android Internals: a Confectioner's Cookbook: Volume 1: the Power Users' View.
Technologeeks.com.
Louden & Lambert. (2011). Programming languages principles and practices. Cengage Learning.
Lowood, H. (2001). The hard work of software history. RBM: A Journal of Rare Books, Manuscripts,
and Cultural Heritage, 2(2), 141-160.
Lury, C., & Wakeford, N. (Eds.). (2012). Inventive methods: The happening of the social. Routledge.
Lynch, A. (2017). Beyond The Browser: From Web Apps To Desktop Apps. Retrieved from
https://www.smashingmagazine.com/2017/03/beyond-browser-web-desktop-apps/
Malpohl, G. (2005). JPlag: detecting software plagiarism [Computer software]. Retrieved from
http://www.ipd.uka.de.
Marres, N. (2012). Experiment: The experiment in living. Inventive methods: The happening of the
social, 76.
Martin, J. A. (2004). Mobile Computing: PDAs vs. Smart Phones. PC World. Retrieved from
https://www.pcworld.com/article/117211/article.html
Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and
Deployment. Linux Journal. Retrieved from https://www.linuxjournal.com/content/docker-
lightweight-linux-containers-consistent-development-and-deployment
Mever, J. (1996). JASMIN USER GUIDE. Retrieved from http://jasmin.sourceforge.net/guide.html
Microsoft. (2004). How to Configure Windows Firewall on a Single Computer. MSDN Library.
Retrieved from https://msdn.microsoft.com/en-us/library/cc875811.aspx
Mockus, A. (2007). Large-scale code reuse in open source software. In Emerging Trends in FLOSS
Research and Development, 2007. FLOSS'07. First International Workshop on (pp. 7-7). IEEE.
46 / 48
Morony, J. (2015). What’s the Difference between Native, Hybrid and Web Mobile App
Development? Retrieved from https://www.joshmorony.com/whats-the-difference-between-
native-hybrid-and-web-mobile-app-development/
Mullin, J. (2016). Second Oracle v. Google trial could lead to huge headaches for developers.
ArsTechnica. Retrieved from https://arstechnica.com/tech-policy/2016/05/round-2-of-oracle-v-
google-is-an-unpredictable-trial-over-api-fair-use/
NIST. (2015). Secure Hash Standard (SHS). FIPS PUB 180-4. Retrieved from
https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
Oracle. (2017c). About the Java Technology. The Java™ Tutorials. Retrieved from
https://docs.oracle.com/javase/tutorial/getStarted/intro/definition.html
Oracle. (2017a). How Will Java Technology Change My Life? The Java™ Tutorials. Retrieved from
https://docs.oracle.com/javase/tutorial/getStarted/intro/changemylife.html
Oracle. (2017b). Naming a Package. The Java™ Tutorials. Retrieved from
https://docs.oracle.com/javase/tutorial/java/package/namingpkgs.html
Pan, B. (2018). dex2jar: Tools to work with android .dex and java .class files [computer software].
URL https://github.com/pxb1988/dex2jar
Parisi, L. (2012). Speculation: A method for the unattainable. Inventive methods: The happening of
the social, 232.
Parr, T. (2013). The definitive ANTLR 4 reference. Pragmatic Bookshelf.
Patel, P. (2018). Why Python is the most popular language used for Machine Learning. Retrieved
from https://medium.com/@UdacityINDIA/why-use-python-for-machine-learning-
e4b0b4457a77
Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning
research, 12(Oct), 2825-2830.
Raymond, E. S., & Enterprises, T. (1997). The cathedral and the bazaar.
Research Centre 1187 ‘Media of Cooperation,’ Artur-Woll-Haus, University of Siegen, Germany,
December 810.
Rhodes, N., & McKeehan, J. (2002). Palm OS programming: the developer's guide. " O'Reilly Media,
Inc.".
Rogers, R. (2013). Digital methods. MIT press.
Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection
techniques and tools: A qualitative approach. Science of computer programming, 74(7), 470-
495.
47 / 48
Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection
techniques and tools: A qualitative approach. Science of computer programming, 74(7), 470-
495.
Shabtai, A., Fledel, Y., & Elovici, Y. (2010). Automated static code analysis for classifying android
applications using machine learning. In 2010 International Conference on Computational
Intelligence and Security (pp. 329-333). IEEE.
Shivaprasad, S. & Muralidharan, M. (2018). Containerized Apache Spark on YARN in Apache Hadoop
3.1. Retrieved from https://hortonworks.com/blog/containerized-apache-spark-yarn-apache-
hadoop-3-1/
Sims, G. (2014). ARM vs X86 Key differences explained! Android Authority. Retrieved from
https://www.androidauthority.com/arm-vs-x86-key-differences-explained-568718/
Sojer, M., & Henkel, J. (2010). Code reuse in open source software development: Quantitative
evidence, drivers, and impediments.
Spreitzenbarth, M. (2012). Comparison of Dalvik and Java Bytecode. Forensic Blog. Retrieved from
https://forensics.spreitzenbarth.de/2012/08/27/comparison-of-dalvik-and-java-bytecode/
Srnicek, N. (2017). Platform capitalism. John Wiley & Sons.
StatCounter. (2018). Mobile Operating System Market Share Worldwide. Retrieved from
http://gs.statcounter.com/os-market-share/mobile/worldwide
Statista. (2018). Global mobile OS market share in sales to end users from 1st quarter 2009 to 1st
quarter 2018. Retrieved from https://www.statista.com/statistics/266136/global-market-
share-held-by-smartphone-operating-systems/
Sterling, G. (2018). US market becoming a smartphone duopoly. Marketing Land. Retrieved from
https://marketingland.com/us-market-becoming-a-smartphone-duopoly-244779
Stevenson, D. (2018). AMD VS Intel: What's The Best Processor? Retrieved from
https://www.techadvisor.co.uk/feature/pc-components/amd-vs-intel-3528212/
Tigani, J., & Naidu, S. (2014). Google BigQuery Analytics. John Wiley & Sons.
Toombs, C. (2015). Android M Will Never Ask Users For Permission To Use The Internet. Android
Police. Retrieved from https://www.androidpolice.com/2015/06/06/android-m-will-never-ask-
users-for-permission-to-use-the-internet-and-thats-probably-okay/
Tumbleson, C. (2018). APKTool: Introduction.
https://ibotpeaches.github.io/Apktool/documentation/
Wilson, B. (2018). Six Reasons Why More Startups Are Going Mobile First. Retrieved from
https://medium.com/swlh/six-reasons-why-more-startups-are-going-mobile-first-
7b466b3a7f4a
48 / 48
World Wide Web Consortium. (2007). W3C TAG Presentation to Workshop on Web of Services for
Enterprise Computing. Retrieved from
https://www.w3.org/2001/tag/2007/03/WSPresentation/contents.html
World Wide Web Consortium. (2006). Extensible markup language (XML) 1.1.
Wroblewski, L. (2011). Mobile First. Retrieved from
https://www.lukew.com/resources/mobile_first.asp
Zauner, C. (2010). Implementation and benchmarking of perceptual image hash functions.
Zhong, N., & Michahelles, F. (2013, March). Google play is not a long tail market: an empirical
analysis of app adoption on the Google play app market. In Proceedings of the 28th Annual
ACM Symposium on Applied Computing (pp. 499-504). ACM.
... Re-situating apps as software also makes them available in addressing the range of enquiries found in critical code and software studies (Fuller, 2008;Montfort et al., 2012). One can study the code up close or parse it through other diagnostic tools enabling comparisons across apps or sets of apps, such as with Appcestry (Chao, 2018). Data sourced from files can also be used to complement other methods. ...
Article
Full-text available
This article discusses methodological approaches to app studies, focusing on their embeddedness and situatedness within multiple infrastructural settings. Our approach involves close attention to the multivalent affordances of apps as software packages, particularly their capacity to enter into diverse groupings and relations depending on different infrastructural situations. The changing situations they evoke and participate in, accordingly, make apps visible and accountable in a variety of unique ways. Therefore, engaging with and even staging these situations allows for political-economic, social, and cultural dynamics associated with apps and their infrastructures to be investigated through a style of research we describe as multi-situated app studies. This article offers an overview of four different entry points of enquiry that are exemplary of this multi-situated approach, focusing on app stores, app interfaces, app packages, and app connections. We conclude with nine propositions that develop out of these studies as prompts for further research.
... Re-situating apps as software also makes them available in addressing the range of enquiries found in critical code and software studies (Fuller, 2008;Montfort et al., 2012). One can study the code up close or parse it through other diagnostic tools enabling comparisons across apps or sets of apps, such as with Appcestry (Chao, 2018). Data sourced from files can also be used to complement other methods. ...
Conference Paper
The panel engages with conceptual and methodological challenges within a specific area of 'internet rules', namely the space of mobile apps. Whereas the web was set out to function as a 'generative' and open technology facilitating the production of unanticipated services and applications, the growing popularity of social media platforms, and mobile apps is characterised by proprietary services that facilitate accessibility but obstruct transparency, tinkering, adjustment, and repurposing. This broader development from 'generative' technologies to 'tethered' devices and services has been referred to as 'appliancization' by Jonathan Zittrain (2008). In addition to Zittrain's focus on the proliferation of proprietary technologies, we suggest that platform infrastructures create specific conditions for the emergence of app ecologies and that apps and platforms are mutually dependent on a technological and economic level. From this perspective, the panel explores a number of novel methodologies for app studies. So far, methodological approaches for studying apps have focused on end-user interfaces and how users interpret app affordances (McVeigh-Schultz and Baym 2015), qualitative analyses of their political economies and the politics of location (Dyer-Witheford 2014; Wilken and Bayliss 2015), their social norms of use (Humphreys 2007) or their affective capacities (Matviyenko et al. 2015). The empirical investigation of apps and their ecologies currently faces multiple challenges: First, in contrast to most data collected from web sites and platforms, user activities can neither be simply observed or scraped from front-end interfaces nor easily be collected via APIs. In order to access app data, researchers may need to participate in using the app, which only affords a partial view (e.g. in the case of Tinder, Snapchat, and messaging apps) thereby opening up a number of ethical concerns. Second, method development has to respond to apps' fast update cultures. Like other internet-enabled technologies, apps are considered as services rather than products and have frequent development cycles, including design and features changes, which do not only require researchers to constantly adjust their tools and approaches, but which also make it particularly difficult to reconstruct the history of an app or its features. This panel responds to these methodological challenges by advancing methodological approaches that all share a common device or medium-specific perspective, departing
Article
Full-text available
In Android, performing a program analysis directly on an executable source is usually inconvenient. Therefore, a reverse engineering technique has been adapted to enable a user to perform a program analysis on a textual form of the executable source which is represented by an intermediate language (IL). For Android, Smali, Jasmin, and Jimple ILs have been introduced to represent applications executable Dalvik bytecode in a human-readable form. To use these ILs, we downloaded three of the most popular Android reversing tools including Apktool, dex2jar, and Soot, which perform transformation of the executable source into Smali, Jasmin, and Jimple ILs, respectively. However, the main concern here is that inaccurate transformation of the executable source may severely degrade the program analysis performance, and obscure the results. To the best of our knowledge, it is still unknown which tool most accurately performs a transformation of the executable source so that the re-assembled Android applications can be executed, and their original behaviours remain intact. Therefore, in this paper, we conduct an experiment to identify the tool which most accurately performs the transformation. We designed a statistical event-based comparative scheme, and conducted a comprehensive empirical study on a set of 1,300 Android applications. Using the designed scheme, we compare Apktool, dex2jar, and Soot via random-event-based and statistical tests to determine the tool which allows the re-assembled applications to be executed, and evaluate how closely they preserve their original behaviours. Our experimental results show that Apktool, using Smali IL, performs the most accurate transformation of the executable source since the applications, which are assembled from Smali, exhibit their behaviours closest to the original ones.
Poster
Full-text available
This project sets out to advance the study of mobile apps at the intersection with platform studies and explores what both fields of study may learn from each other. A novel empirical methodology is developed to explore the intricate relations between mobile apps and social media platforms. Our findings suggest to think of apps as relational software entities, simultaneously situated and distributed. Apps exist as part of wider ecologies made up of programmable infrastructures and controlled data flows. Furthermore, this empirical investigation interfaces apps with platform studies. First, it contributes to the study of mobile apps by providing a novel empirical methodology for mapping app–platform relations and thereby providing an account of apps as software entities that are both situated (existing “in context”) and distributed (both shaped by and shaping relations to platforms and diverse stakeholders). Second, it also contributes to the study of platforms by offering insights into stakeholder politics and practices, which we argue are crucial to understanding the defining features of platforms: their programmability, distinct affordances, multiplicitous stakeholders, and strategies for negotiating openness / closure. Download or share: http://bit.ly/app-support-ecologies
Conference Paper
Full-text available
Because it is not hard to reverse engineer the Dalvik bytecode used in the Dalvik virtual machine, Android application repackaging has become a serious problem. With repackaging, a plagiarist can simply steal others' code violating the intellectual property of the developers. More seriously, after repack-aging, popular apps can become the carriers of malware, adware or spy-ware for wide spreading. To maintain a healthy app market, several detection algorithms have been proposed recently, which can catch some types of repackaged apps in various markets efficiently. However, they are generally lack of valid analysis on their effectiveness. After analyzing these approaches, we find simple obfus-cation techniques can potentially cause false negatives, because they change the main characteristics or features of the apps that are used for similarity detections. In practice, more sophisticated obfuscation techniques can be adopted (or have already been performed) in the context of mobile apps. We envision this obfusca-tion based repackaging will become a phenomenon due to the arms race between repackaging and its detection. To this end, we propose a framework to evaluate the obfuscation resilience of repackaging detection algorithms comprehensively. Our evaluation framework is able to perform a set of obfuscation algorithms in various forms on the Dalvik bytecode. Our results provide insights to help gauge both broadness and depth of algorithms' obfuscation resilience. We applied our framework to conduct a comprehensive case study on AndroGuard, an Android repackaging detector proposed in Black-hat 2011. Our experimental results have demonstrated the effectiveness and stability of our framework.
Article
Full-text available
In this article, I inquire into Facebook’s development as a platform by situating it within the transformation of social network sites into social media platforms. I explore this shift with a historical perspective on, what I refer to as, platformization, or the rise of the platform as the dominant infrastructural and economic model of the social web and its consequences. Platformization entails the extension of social media platforms into the rest of the web and their drive to make external web data “platform ready.” The specific technological architecture and ontological distinctiveness of platforms will be examined by taking their programmability into account. I position platformization as a form of platform critique that inquires into the dynamics of the decentralization of platform features and the recentralization of “platform ready” data as a way to examine the consequences of the programmability of social media platforms for the web.
Article
This book discusses the transformation of firms into platforms-companies providing software and hardware products to others-that has occurred in many economic sectors. This massive transformation resulted from switching capitalism into data, considering them as a source for economic growth and resilience. Changes in digital technologies contributed much to the relationships between companies and their workers, clients, and other capitalists, who increasingly began to rely on data. Dr. Nick Srnicek critically reviews "platform capitalism", putting new forms of the business model into the context of economic history, tracing their evolution from the long downturn of the 1970s to the economic boom of the 1990s and to the consequences of the 2008 financial crisis. The author demonstrates that the global economy was re-divided among a few of the monopolistic platforms and shows how these platforms set up new internal trends for the development of capitalism. © 2019 National Research University Higher School of Economics. All Rights Reserved.
Book
Do you… Use a computer to perform analysis or simulations in your daily work? Write short scripts or record macros to perform repetitive tasks? Need to integrate off-the-shelf software into your systems or require multiple applications to work together? Find yourself spending too much time working the kinks out of your code? Work with software engineers on a regular basis but have difficulty communicating or collaborating? If any of these sound familiar, then you may need a quick primer in the principles of software engineering. Nearly every engineer, regardless of field, will need to develop some form of software during their career. Without exposure to the challenges, processes, and limitations of software engineering, developing software can be a burdensome and inefficient chore. In What Every Engineer Should Know about Software Engineering, Phillip Laplante introduces the profession of software engineering along with a practical approach to understanding, designing, and building sound software based on solid principles. Using a unique question-and-answer format, this book addresses the issues and misperceptions that engineers need to understand in order to successfully work with software engineers, develop specifications for quality software, and learn the basics of the most common programming languages, development approaches, and paradigms.
Article
Background and objectives: Research on children's use of mobile media devices lags behind its adoption. The objective of this study was to examine young children's exposure to and use of mobile media devices. Methods: Cross-sectional study of 350 children aged 6 months to 4 years seen October to November 2014 at a pediatric clinic in an urban, low-income, minority community. The survey was adapted from Common Sense Media's 2013 nationwide survey. Results: Most households had television (97%), tablets (83%), and smartphones (77%). At age 4, half the children had their own television and three-fourths their own mobile device. Almost all children (96.6%) used mobile devices, and most started using before age 1. Parents gave children devices when doing house chores (70%), to keep them calm (65%), and at bedtime (29%). At age 2, most children used a device daily and spent comparable screen time on television and mobile devices. Most 3- and 4-year-olds used devices without help, and one-third engaged in media multitasking. Content delivery applications such as YouTube and Netflix were popular. Child ownership of device, age at first use, and daily use were not associated with ethnicity or parent education. Conclusions: Young children in an urban, low-income, minority community had almost universal exposure to mobile devices, and most had their own device by age 4. The patterns of use suggest early adoption, frequent and independent use, and media multitasking. Studies are urgently needed to update recommendations for families and providers on the use of mobile media by young children.