PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

To provide privacy-aware software systems, it is crucial to consider privacy from the very beginning of the development. However, developers do not have the expertise and the knowledge required to embed the legal and social requirements for data protection into software systems. Objective: We present an approach to decrease privacy risks during agile software development by automatically detecting privacy-related information in the context of user story requirements, a prominent notation in agile Requirement Engineering (RE). Methods: The proposed approach combines Natural Language Processing (NLP) and linguistic resources with deep learning algorithms to identify privacy aspects into User Stories. NLP technologies are used to extract information regarding the semantic and syntactic structure of the text. This information is then processed by a pre-trained convolutional neural network, which paved the way for the implementation of a Transfer Learning technique. We evaluate the proposed approach by performing an empirical study with a dataset of 1680 user stories. Results: The experimental results show that deep learning algorithms allow to obtain better predictions than those achieved with conventional (shallow) machine learning methods. Moreover, the application of Transfer Learning allows to considerably improve the accuracy of the predictions, ca. 10%. Conclusions: Our study contributes to encourage software engineering researchers in considering the opportunities to automate privacy detection in the early phase of design, by also exploiting transfer learning models.
Content may be subject to copyright.
Detecting Privacy Requirements from User Stories with NLP Transfer
Learning Models
Francesco Casilloa,Vincenzo Deufemiaaand Carmine Gravinoa,
aDepartment of Computer Science, University of Salerno, Via Giovanni Paolo II, 132, Fisciano(SA), 84084, Italy
ARTICLE INFO
Keywords:
User Stories
Natural Language Processing
Deep Learning
Transfer Learning
ABSTRACT
Context: To provide privacy-aware software systems, it is crucial to consider privacy from the very
beginning of the development. However, developers do not have the expertise and the knowledge
required to embed the legal and social requirements for data protection into software systems.
Objective: We present an approach to decrease privacy risks during agile software development by
automatically detecting privacy-related information in the context of user story requirements, a promi-
nent notation in agile Requirement Engineering (RE).
Methods: The proposed approach combines Natural Language Processing (NLP) and linguistic re-
sources with deep learning algorithms to identify privacy aspects into User Stories. NLP technologies
are used to extract information regarding the semantic and syntactic structure of the text. This infor-
mation is then processed by a pre-trained convolutional neural network, which paved the way for the
implementation of a Transfer Learning technique. We evaluate the proposed approach by performing
an empirical study with a dataset of 1680 user stories.
Results: The experimental results show that deep learning algorithms allow to obtain better pre-
dictions than those achieved with conventional (shallow) machine learning methods. Moreover, the
application of Transfer Learning allows to considerably improve the accuracy of the predictions, ca.
10%.
Conclusions: Our study contributes to encourage software engineering researchers in considering the
opportunities to automate privacy detection in the early phase of design, by also exploiting transfer
learning models.
1. Introduction
Requirements engineering (RE) is one of the most com-
plex activity of software engineering. Misunderstandings
and imperfections in the requirement documents can easily
lead to design flaws and cause several problems [1,2]. Ag-
ile RE is based on face-to-face collaboration between cus-
tomers and developers which helps to address several RE
problems, but this does not exclude the presence of others.
Among them, the detection of non-functional requirements
(NFRs) by stakeholders is often a difficult activity due to sev-
eral reasons [3]. To alleviate this problem, several solutions
for the automatic detection of NFRs from text documents
have been proposed [4,5,6,7]. For instance, Slankas et
al. have proposed a tool-based approach, named NFR Loca-
tor, to extract sentences in unconstrained natural language
documents, which are classified into one of the 14 defined
NFR categories [7]. In general, these NFR detection tools
provide only an overview of the identified NFRs. However,
since stakeholders usually have expertise in few specific ar-
eas, they might have difficulties in defining all the features
of a software application, increasing the risk of neglecting
This work has been partially supported by the Italian Ministry of Ed-
ucation, University and Research (MIUR) under grant PRIN 2017 “EM-
PATHY: Empowering People in deAling with internet of THings ecosYs-
tems” (Progetti di Rilevante Interesse Nazionale - Bando 2017, Grant
2017MX9T7H).
Corresponding author
fcasillo@unisa.it (F. Casillo); deufemia@unisa.it (V. Deufemia);
gravino@unisa.it (C. Gravino)
ORCID (s): 0000-0003-4869-8068 (F. Casillo); 0000-0002-6711-3590 (V.
Deufemia); 0000-0002-4394-9035 (C. Gravino)
some of them [3].
Privacy is an essential NFR that needs special attention
as business needs require data protection and safeguarding
[8]. Even if privacy requirements frequently appear in soft-
ware documentations, most of the time stakeholders ignore
them. The difficulty of privacy requirement identification
mainly depends from the quality of requirement specifica-
tions as shown in several studies (e.g., [9,10]).
In this paper we propose a deep learning approach to
identify possible privacy requirements within User Stories
(USs). The proposed solution aims to support practitioners,
with poor privacy expertise, in the identification of NFRs
related to privacy. Although a lot has been done in the field
of privacy detection, to the best of our knowledge no study
deals with the analysis of USs. Thus, we verify whether it
is possible to exploit knowledge and tools proposed to ad-
dress similar problems. With respect to conventional ma-
chine learning methods, the deep ones have unique advan-
tages in feature extraction and semantic mining [11], and
have achieved excellent results in text classification tasks
[12,13,14,15,16,17]. Thus, from the analysis of user sto-
ries the deep learning models can infer individual privacy in-
formation and privacy rules, which can be used to recognize
privacy-related entities for individual user stories. Then, the
users can be reminded of the possibility of privacy leakage,
based on the defined privacy rules.
The proposed approach combines the use of linguistic re-
sources and Natural Language Processing (NLP) techniques
to extract features useful not only to capture the semantic
meaning and the syntactic structure of the text, but also to
determine the presence or absence of privacy-related words.
Francesco Casillo et al.: Preprint submitted to Elsevier Page 1 of 15
arXiv:2202.01035v1 [cs.SE] 2 Feb 2022
Detecting Privacy Requirements from User Stories
A further peculiarity of our approach is the use of Transfer
Learning (TL), an emergent strategy where a system devel-
oped for a task is reused for a model on a different but related
task [18,19,20]. Specifically, we use a pre-trained convolu-
tional neural network (CNN) designed to identify personal,
private disclosures from short texts [12] to extract features
from user stories, which are combined with features obtained
from a privacy dictionary to construct a US-privacy classi-
fier.
To show the effectiveness of our approach, we present
the results of an empirical study carried out by exploiting
a dataset of 1680 user stories taken from [21]. In particu-
lar, we present a type of sanity check by formulating two
research questions with the aim of verifying if a deep learn-
ing method (CNN) performs at least as conventional (shal-
low) machine learning methods, when exploiting NLP-based
features (RQ1) or privacy word features (RQ2). The sanity
check allows us to verify whether the further effort needed to
apply CNN is payed back by an improvement in the predic-
tion accuracy, and the possible contribution of PW features
when applying shallow and deep learning methods.
The comparison between shallow and deep learning meth-
ods is often performed when evaluating text classification
tools (e.g., [13,14,15]), mainly due to the possible noise
in the data that can lead to substantial changes in the accu-
racy of decisions [13]. In particular, in some studies, shallow
learning methods outperformed the deep ones in text classi-
fication tasks [15]. In our study, no clear result is obtained
in the comparison when exploiting PW features (RQ2), thus
confirming the importance of performing this kind of check.
Differently, the results about RQ1 show that the deep learn-
ing method performs significantly better than the conven-
tional machine learning methods, when exploiting NLP-based
features.
After performing the two sanity checks, we investigate
the proposed NLP-based Transfer Learning method by for-
mulating a third research question (RQ3) aiming to com-
pare its performances with those achieved using deep learn-
ing methods based on NLP-based features or privacy word
(PW) features. The experimental results for RQ3 reveal an
improvement of more than 10% (in terms of both Accuracy
and F1-score [22]) compared to the individual CNNs.
Organization of the paper. Section 2presents the re-
search background on agile requirement engineering and how
privacy is typically analyzed in this context. Section 3de-
scribes the approach designed to identify privacy aspects in
agile requirement specifications. Section 4describes the de-
sign of the empirical study carried out to evaluate the ap-
proach. Section 5reports on quantitative results and dis-
cusses the main findings. Section 6concludes the paper and
presents future research directions.
2. Related work
This section analyzes the different NLP techniques pro-
posed in literature for US analysis. USs typically follow a
structured format characterized by the who, the what, and
the why of a requirement, becoming a standard de facto [23].
An example of user story defined by using the Cohn’s model
[24] is:
As a site member, I want to access to the Facebook profiles
of other members so that I can share my experiences with
them
Several frameworks and methodologies have been pro-
posed for analyzing the quality of USs through their syn-
tactic analysis, with the aim of making them more accurate
and clear for customer’s requirement definition [25,26,27].
Other relevant investigations concern with the transforma-
tion of USs into models and components useful for the next
stages of the software development processes. In particu-
lar, software diagrams can be automatically generated from
USs in order to provide a visual representation for project
stakeholders, to identify conceptual entities, or to highlight
potential problems in US definition [28,29,30,31]. These
activities open the door to further automated analysis able to
generate conceptual models [32], Use Case scenarios [33],
and even Backlog Items [34]. USs can also be analyzed to
automatically generate test cases [35] and create behavioral
models that help testers who might be non-expert.
Other studies on USs focus on extracting attributes that
can guide architecture design without relying on systemati-
cally and formally defined knowledge. For example, Gilson
et al. show that USs might have a great impact on early stage
decisions (because they might implicitly refer to quality at-
tributes) allowing software architects to have an idea of the
consequences of the possible design decisions [36]. In par-
ticular, they use machine learning (ML) techniques to clas-
sify if the USs refer to quality attributes and, if so, which
ones they refer to.
Approaches dealing with other types of attributes are able
to provide more detailed information, which improves the
definition of customer requirements and facilitates decisions
made during the software development process [3]. For ex-
ample, Villamizar et al. define an approach for reviewing
security-related aspects in agile requirement specifications
with a focus on web applications [37]. Their results indicate
significant differences when comparing the performance achieved
by experts using their approach against other defect-based
techniques. Similarly, Riaz et al. propose a ML-based tool
that takes in input a set of natural language artifacts and auto-
matically identifies (and classifies) security-relevant phrases
according to predefined security objectives [38]. However,
stakeholders may not be able to assess and define all aspects
of a software application together with customers, increas-
ing the risk of leaving out even high priority ones [3], as in
the case of data privacy.
Many efforts have been devoted to privacy disclosure in
the recent years, both to facilitate the work of analysts and
developers [39,40] and to define a linguistic taxonomy of
privacy for content analysis [41,42]. Many of the privacy
detection approaches focus on the automatic recognition of
sensitive personal information in unstructured text [16,43,
12], which allows to develop several interesting tools, such
Francesco Casillo et al.: Preprint submitted to Elsevier Page 2 of 15
Detecting Privacy Requirements from User Stories
as TABOO [17] and PrivacyBot [44]. On the other hand,
many companies have particular needs with respect to per-
sonal data processing, and in the software design phase these
needs may be set aside to make space for more functional
requirements [45]. Therefore, the identification of privacy
content can be considered crucial when building the archi-
tecture of a software system. However, to the best of our
knowledge, nothing is proposed in the literature about the
automatic identification of privacy content in the early stages
of Agile software development. This work aims to fill this
gap, by providing and evaluating an approach for detecting
privacy information from USs.
3. A Methodology for Privacy Disclosure
Detection within User Stories
The proposed technique aims at identifying privacy-related
threats in agile requirement specifications. The approach
considers USs and linguistic resources as input, and exploits
NLP techniques to determine the presence or absence of privacy-
related words in the USs. The latter are structured in a sen-
tence as follows [37]:
As a [role], I want to [feature], so that [reason].
Although this structure simplifies US’s comprehension, the
detection of privacy disclosures may be ineffective due to a
wide variety of possible terms in USs. Therefore, more ad-
vanced approaches are needed to improve its effectiveness.
The proposed method leverages convolutional deep neu-
ral networks to identify short texts of USs having private
disclosures. In particular, we first adopt a lexicon-based ap-
proach to identify the words having entity-level privacy dis-
closures, by using the matches between USs and a privacy
dictionary as machine learning features. This method can
give high precision, but low recall since it relies only on the
count of sensitive words in a document, without considering
the context in which these words are used. To improve re-
call, we also exploit NLP tools to derive linguistic features,
such as syntactic dependencies and entity relations, which
keep the sentence level context into consideration.
The proposed deep neural network model combines to-
gether multiple channels to perform the disclosure/non-disclosure
classification task. Each channel refers to different represen-
tations of the same candidate user story.
To deal with the paucity of curated data in the field, we
propose the use of transfer learning, which allows to utilize
knowledge acquired for one task to solve related ones. In
particular, we exploit a pre-trained CNN that exploits NLP-
based features to detect privacy disclosures in Reddit users
posts and comments. This neural network is trained on 10K
disclosure and non-disclosure sentences.
In what follows we provide details of the features used for
privacy disclosure detection (Sections 3.1 and 3.2), and the
architectures of the considered deep neural network models
(Sections 3.3 and 3.4).
3.1. Lexicon-based privacy disclosure features
These features are extracted from the text of the USs by
using linguistic resources, i.e., dictionaries, containing in-
dividual words or phrases that are assigned to one or more
linguistic categories. By using a privacy dictionary it is pos-
sible to count the occurrences of each dictionary word within
a US text, incrementing the relevant categories to which the
words belong [41,42]. The final result consists of values for
each linguistic privacy category, represented as a percentage
of the total words in the text.
We use the privacy dictionary proposed by Vasalou et al.
[42], which constructed and validated eight dictionary cate-
gories on empirical material from a wide range of privacy-
sensitive contexts. Experimental results have shown that the
identified categories allow to effectively detect privacy lan-
guage patterns within a given text. Figures 1highlights the
two US words contained in the privacy dictionary defined
in [41,42], whereas Table 1reports the information on the
privacy category OpenVisible they belong to.
3.2. NLP-based features for privacy disclosure
These linguistic features are obtained from the text of
the USs by extrapolating entities, the parts of speech, and
the dependencies between them, since the aim is to under-
stand the text from its meaning and to capture those features
that may affect the classification of USs as related to privacy
disclosure.
In what follows, we describe how these NLP-based fea-
tures have been obtained for the US shown in Figure 1by us-
ing the NLP spaCy toolkit1. First, the text is pre-processed
by removing punctuation and insignificant words, leaving
only lexical items (tokenization). The result of this process
for the considered US is: [‘As, ‘a’, ‘site’, ‘member’, ‘I’,
‘want’, ‘to’, ‘access’, ‘to’, ‘the’, ‘Facebook’, ‘profiles’, ‘of’,
‘other’, ‘members’, ‘so’, ‘that, ‘I’, ‘can’, ‘share’, ‘my’, ‘ex-
periences’, ‘with’, ‘them’]
The Dependency Parser (DP) Toolkit2from spaCy has
been used to extract information on syntactic relations and
parts of speech (POS), which enable the data to be enriched
with such information on syntactic and semantic structure.
Table 2reports the POSs and dependencies extracted from
the user story of Figure 1. These features help the model to
understand the common sequence of tokens and the occur-
rence of dependency tags [46].
The Named Entity Recognizer (NER)3of spaCy has been
used to assign labels to contiguous tokens. The default model
provided by the library identifies various entities, such as
companies, locations, organizations, and products, and new
entities can be added to the system by updating the model
with new data.
3.3. Deep Neural Network Models
After doing all the necessary pre-processing steps, the
data is then fed into a multi-input deep neural network to
1https://spacy.io/
2https://spacy.io/usage/linguistic-features#dependency- parse
3https://spacy.io/usage/linguistic-features#named- entities
Francesco Casillo et al.: Preprint submitted to Elsevier Page 3 of 15
Detecting Privacy Requirements from User Stories
As a site member, I want to access to the Facebook profiles of other members so that I can share my exp eriences with them
Figure 1: The US words highlighted in red are contained in the privacy dictionary defined in [42].
Category name (number of words) Description Example dictionary words
OpenVisible ( 2 ) open and public access to people port, display, accessible
Table 1
The privacy category of the words ‘access’ and ‘share’ [41].
Table 2
Parts-of-speech and dependencies extracted from a user story.
Text Part of speech Dependency
As SCONJ prep
a DET det
site NOUN compound
member NOUN pobj
I PRON nsubj
want VERB ROOT
to PART aux
access VERB xcomp
to ADP prep
the DET det
Facebook PROPN compound
profiles NOUN pobj
of ADP prep
other ADJ amod
members NOUN pobj
so SCONJ mark
that SCONJ mark
I PRON nsubj
can VERB aux
share VERB advcl
my DET poss
experiences NOUN dobj
with ADP prep
them PRON pobj
learn the hidden patterns and features to distinguish between
texts having disclosure and non-disclosure occurrences. In
particular, we constructed two deep convolutional neural net-
works, one based on the NLP-based features introduced in
Section 3.2 (see Figure 2), the other on the lexicon-based
features of Section 3.1 (see Figure 3). The first takes lexical
(word tokens) features through one input, syntactical fea-
tures (dependency parse tree information) through another
input following a merging of those feature vectors. Later
these vectors additionally get merged with supplemental (aux-
iliary) inputs before going through a further multi-layer per-
ceptron stage. At the end of the deep neural network, a single
neuron is used to provide the probability toward each of the
above mentioned classes. The latter performs similar oper-
ations fed the features obtained from the privacy dictionary.
3.4. A Transfer Learning Methodology for
Privacy Disclosure Detection
The previous deep neural networks require a specific dataset
to train the model from scratch to the specific classification
task. Unfortunately, the datasets of user stories available in
literature contain few hundreds of examples. For this reason,
it could be possible that the models are not able to adequately
learn how to classify a US. To deal with this problem we in-
troduce a neural network model that exploits transfer learn-
ing for the disclosure/non-disclosure classification task.
Transfer learning is an approach in which the knowledge
learned from a large-scale dataset to solve a particular task
is reused (transferred) and applied to solve a different but
related task [18]. In particular, transfer learning allows to
use pre-trained shallow/deep learning models by fine-tuning
them on a relatively small labeled dataset from the down-
stream task.
In the proposed model, the NLP-based features described
in Section 3.2 are processed by a pre-trained convolutional
neural network whose aim is to identify short texts that have
personal, private disclosures [12]. In particular, the neural
network identifies whether the unstructured text given as in-
put contains private disclosures by analyzing the semantic
and syntactic structure of the text through the extraction of
the characteristics described above, i.e., entities, dependen-
cies, and parts of speech. This network has been trained
for privacy disclosure classification on ten thousand Reddit
users’ posts and comments.
Figure 4shows the architecture of the deep neural net-
work, named 𝑃 𝐷𝑇 𝐿, obtained by applying transfer learning.
Taking advantage of the flexibility of the tools provided by
Keras1, the pre-trained neural network proposed in [12] has
been truncated after the Flatten layer. The latter is concate-
nated with the Flatten layer of another neural network that
processes the lexicon-based privacy features. As a conse-
quence, the resulting neural network processes the informa-
tion concerning the semantic and syntactic structure, enrich-
ing this analysis with the information derived from the pri-
vacy dictionary.
4. Empirical study design
In this section we present the design of the empirical
study we have performed. In particular, we first provide the
research questions and the motivations behind their formu-
1https://keras.io/
Francesco Casillo et al.: Preprint submitted to Elsevier Page 4 of 15
Detecting Privacy Requirements from User Stories
Figure 2: The 𝐶 𝑁 𝑁𝑁 𝐿𝑃 architecture.
Figure 3: The 𝐶 𝑁 𝑁𝑃 𝑊 architecture.
lation. Then, data employed for the analysis is described,
followed by the presentation of the validation methods. In
the last part of the section, the evaluation criteria we adopted
for assessing the predictions achieved with the built machine
learning models and threats to validity discussion are pre-
sented. The data and scripts to train the models and repro-
duce the results may be found online at https://tinyurl.com/
US-privacy.
4.1. Research questions
The aim of our investigation is to assess the application
of advanced methods and technologies to detect privacy con-
tent from USs. As mentioned in the introduction, we have
first performed a sort of sanity check to verify:
a) if a deep learning method (𝐶 𝑁 𝑁𝑁 𝐿𝑃 ) performs at
least as shallow machine learning methods, when NLP-
based features are exploited;
b) if a deep learning method (𝐶 𝑁 𝑁𝑃 𝑊 ) performs at least
as shallow machine learning methods, when privacy
word (PW) features are exploited.
Then, starting from the consideration that user story datasets
are difficult to obtain, especially those containing sensible
information, we have investigated the use of Transfer Learn-
ing (TL) which allows developers to analyze the similarities
between different tasks and to exploit a neural network used
for one task in a given domain and apply it to another do-
main.
To conduct this research study, we have formulated three
research questions:
RQ1 Is 𝐶 𝑁 𝑁𝑁𝐿𝑃 accurate at least as conventional ma-
chine learning methods to detect privacy content when
using NLP-based features?
RQ2 Is 𝐶 𝑁 𝑁𝑃 𝑊 accurate at least as conventional machine
learning methods to detect privacy content when using
PW features?
RQ3 Are predictions obtained with 𝑃 𝐷𝑇 𝐿 better than those
achieved with 𝐶𝑁 𝑁𝑁 𝐿𝑃 and 𝐶 𝑁 𝑁𝑃 𝑊 ?
To answer RQ1 we have considered a convolutional net-
work that is trained on different features extracted through
different NLP techniques to predict if USs contain privacy
information. The considered neural networks are powerful
and flexible models that have the ability to detect complex
patterns even with limited training data. These models have
Francesco Casillo et al.: Preprint submitted to Elsevier Page 5 of 15
Detecting Privacy Requirements from User Stories
Figure 4: The 𝑃 𝐷𝑇 𝐿 architecture.
showed high performance in several domains, including nat-
ural language processing [47]. It is therefore reasonable to
assume that they are also effective in this context.
As for conventional machine learning methods, we have
considered: Logistic Regression (LR), Support Vector Ma-
chine (SVM), Gaussian Naive Bayes (GNB), k-Nearest Neigh-
bors (kNN), Random Forest (RF), and Decision Tree (DT).
In the following, we name the models using NLP-based fea-
tures as: 𝐿𝑅𝑁𝐿𝑃 ,𝑆 𝑉 𝑀𝑁 𝐿𝑃 ,𝐺𝑁 𝐵𝑁𝐿𝑃 ,𝑘𝑁 𝑁𝑁 𝐿𝑃 ,𝑅𝐹𝑁 𝐿𝑃 ,
and 𝐷𝑇𝑁 𝐿𝑃 . The choice of using approaches like LR and
SVM is not accidental: they are often used in the litera-
ture for solving relevant problems in software engineering.
Moreover, they are particularly suitable when dealing with
binary classification tasks.
Similarly, for addressing RQ2 we built and compare mod-
els obtained with CNN, LR, SVM, GNB, kNN, RF, and DT
using PWs as features. In the following, we name these mod-
els as 𝐶 𝑁 𝑁𝑃 𝑊 ,𝐿𝑅𝑃 𝑊 ,𝑆 𝑉 𝑀𝑃 𝑊 ,𝐺𝑁𝐵 𝑃 𝑊 ,𝑘𝑁 𝑁𝑃 𝑊 ,
𝑅𝐹𝑃 𝑊 , and 𝐷𝑇𝑃 𝑊 .
To address RQ3, we have considered the CNN, named
𝑃 𝐷𝑇 𝐿, defined in Section 3.4. In particular, the expecta-
tion is that the model resulting from transfer learning can
provide better predictions than 𝐶𝑁 𝑁𝑁 𝐿𝑃 and 𝐶 𝑁 𝑁𝑃 𝑊 , as
𝐶 𝑁 𝑁𝑁𝐿𝑃 is trained on few data containing privacy infor-
mation and does not exploit PW features, while 𝐶𝑁 𝑁𝑃 𝑊
analyzes USs on a smaller set of features than 𝑃 𝐷𝑇𝐿 .
4.2. Data Collection
The proposed model for the detection of privacy disclo-
sures in USs needs data on which it is trained in order to
make predictions. In particular, the data it needs should
consist of a set of USs, each enriched by a label indicating
Francesco Casillo et al.: Preprint submitted to Elsevier Page 6 of 15
Detecting Privacy Requirements from User Stories
Table 3
Properties of the datasets used for the research.
Dataset Description Size Privacy
Terms %PW&Di %PW %Di %None
1 Online platform for delivering transparent information on US govern-
mental spending
98 118 0.224 0.194 0.388 0.194
2 Electronic land management system for the Loudoun County, Virginia 58 107 0.328 0.000 0.638 0.340
3 An online platform to support waste recycling 51 86 0.176 0.137 0.137 0.549
4 Website for create a transparent overview of governmental expenses 53 85 0.566 0.151 0.170 0.113
5 Platform for obtaining insights from data 66 69 0.742 0.091 0.106 0.061
6 First version of the Scrum Alliance Website 97 115 0.175 0.031 0.670 0.124
7 New version of the NSF website: redesign and content discovery 73 115 0.041 0.000 0.740 0.219
8 App for camp administrators and parents 55 56 0.273 0.182 0.164 0.382
9 First version of the PlanningPoker.com website 53 53 0.170 0.057 0.623 0.151
10 Platform to find, share and publish data online 67 63 0.552 0.134 0.104 0.209
11 Management information system for Duke University 68 132 0.206 0.191 0.206 0.397
12 Simplified toolbox to enable fast and easy development with Hadoop 64 67 0.109 0.219 0.219 0.453
13 Research data management portal for the university of Oxford, Read-
ing and Southampton
102 119 0.186 0.186 0.245 0.382
14 Personal interactive assistant for independent living and active aging 138 126 0.036 0.065 0.413 0.486
15 Conference registration and management platform 69 106 0.116 0.430 0.739 0.101
16 Software for machine-actionable data management plans 83 115 0.578 0.181 0.229 0.012
17 Web-based archiving information system 57 72 0.123 0.070 0.211 0.592
18 Institutional data repository for the University of Bath 53 89 0.660 0.038 0.226 0.075
19 Repository for different types of digital content 100 88 0.050 0.120 0.220 0.610
20 Software for archivists 100 117 0.250 0.130 0.430 0.190
21 Digital content management system for Cornell University 115 173 0.252 0.157 0.391 0.200
22 Citizen science platform that allows anyone to help in research tasks 60 82 0.050 0.067 0.400 0.483
whether that US has privacy disclosures, and by as many fea-
tures as possible that contribute to the assertion of privacy
relations. Datasets of this type, or similar, have not been
found either on the Web or in the literature. Therefore, there
was a need to build such a dataset, starting from a set of USs
from which to extrapolate the characteristics that the model
needs to make reliable predictions. To this end, a search
was carried out to identify a large set of USs: this led to the
discovery of 22 publicly available datasets, each containing
more than 50 USs [21]. The method used to obtain these
datasets is described in detail in [48].
Table 3reports details and statistics about the consid-
ered datasets. Each row provides a brief description of the
project, the number of USs, the number of privacy terms
contained in the USs, and statistics about the NLP features
used by the proposed approaches. In particular, each US was
processed through the different NLP techniques in order to
extrapolate the useful features to the subsequently defined
models. The last four columns of the table indicate the per-
centages of USs containing: both Privacy Words and Disclo-
sures (PW&Di), only Privacy Words (PW), only Disclosures
(Di), none of the above (None). Note that the first author of
the paper was in charge of manually classifying the privacy
information, while the other two cross-checked the data. Ta-
ble 4shows four USs together with the extracted NLP fea-
tures.
This dataset was manually analyzed to verify if it was
heterogeneous enough, i.e., if it included enough instances
of each type of USs. The types of USs are the result of the
assumption explained in the previous section. In particular,
the types identified are: USs containing privacy words and
Figure 5: Partitions of the dataset for each type of US.
disclosures, USs containing only privacy words, USs con-
taining only disclosures, USs that do not contain neither pri-
vacy words nor disclosures. Figure 5shows the percentages
of USs for each type for the considered dataset. Special at-
tention was paid to types that contained only one of the two
properties implying the presence of privacy content. If they
did not have a large number of instances, there was a risk
that the model would fail to differentiate correctly between
the various types of USs, thus compromising the validity of
the prediction.
The independent variables identified are the features ex-
tracted through NLP, thus entities, dependencies, parts of
speech, privacy words, and privacy categories, while the de-
pendent variables are Accuracy and F1-score. The choice of
the latter variables is explained in the following section.
Francesco Casillo et al.: Preprint submitted to Elsevier Page 7 of 15
Detecting Privacy Requirements from User Stories
Table 4
Overview of the dataset used for the empirical study.
# User Story Entities Dependencies Parts of Speech Privacy Categories Privacy
Words
Disclosure?
0 As a Data
user, I want to
have the 12-19-
2017 deletions
processed.
[’As’, ’a’, ’Data’, ’user’,
’PERSON’, ’want’, ’to’,
’have’, ’the’, ’12’, ’19’,
’2017’, ’deletions’, ’pro-
cessed’]
[’prep’, ’det’, ’compound’,
’pobj’, ’nsubj’, ’ROOT’,
’aux’, ’xcomp’, ’det’,
’nummod’, ’nummod’,
’nummod’, ’dobj’, ’acl’]
[’SCONJ’, ’DET’, ’PROPN’,
’NOUN’, ’PRON’, ’VERB’,
’PART’, ’AUX’, ’DET’,
’NUM’, ’NUM’, ’NUM’,
’NOUN’, ’VERB’]
[[’PrivateSecret’, 1]] [’data’] 0
1 As a UI de-
signer, I want to
redesign the Re-
sources page, so
that it matches
the new Broker
design styles.
[’As’, ’a’, ’HEALTH’,
’HEALTH’, ’PERSON’,
’want’, ’to’, ’redesign’,
’the’, ’Resources’, ’page’,
’so’, ’that’, ’it’, ’matches’,
’the’, ’new’, ’PRODUCT’,
’design’, ’styles’]
[’prep’, ’det’, ’com-
pound’, ’pobj’, ’nsubj’,
’ROOT’, ’aux’, ’xcomp’,
’det’, ’compound’, ’dobj’,
’mark’, ’mark’, ’nsubj’,
’advcl’, ’det’, ’amod’,
’compound’, ’compound’,
’dobj’]
[’SCONJ’, ’DET’, ’PROPN’,
’NOUN’, ’PRON’, ’VERB’,
’PART’, ’VERB’, ’DET’,
’PROPN’, ’NOUN’, ’SCONJ’,
’SCONJ’, ’PRON’, ’VERB’,
’DET’, ’ADJ’, ’PROPN’,
’NOUN’, ’NOUN’]
none none 0
2 As a UI de-
signer, I want
to report to the
Agencies about
user testing, so
that they are
aware of their
contributions to
making Broker a
better UX.
[’As’, ’a’, ’HEALTH’,
’HEALTH’, ’PERSON’,
’want’, ’to’, ’report’, ’to’,
’the’, ’ORG’, ’about’,
’user’, ’testing’, ’so’,
’that’, ’PERSON’, ’are’,
’aware’, ’of’, ’their’, ’con-
tributions’, ’to’, ’making’,
’PRODUCT’, ’a’, ’better’,
’UX’]
[’prep’, ’det’, ’compound’,
’pobj’, ’nsubj’, ’ROOT’,
’aux’, ’xcomp’, ’prep’,
’det’, ’pobj’, ’prep’, ’com-
pound’, ’pobj’, ’mark’,
’mark’, ’nsubj’, ’advcl’,
’acomp’, ’prep’, ’poss’,
’pobj’, ’prep’, ’pcomp’,
’nsubj’, ’det’, ’amod’,
’ccomp’]
[’SCONJ’, ’DET’, ’PROPN’,
’NOUN’, ’PRON’, ’VERB’,
’PART’, ’VERB’, ’ADP’,
’DET’, ’PROPN’, ’ADP’,
’NOUN’, ’NOUN’, ’SCONJ’,
’SCONJ’, ’PRON’, ’AUX’,
’ADJ’, ’ADP’, ’DET’,
’NOUN’, ’ADP’, ’VERB’,
’PROPN’, ’DET’, ’ADJ’,
’PROPN’]
[[’OpenVisible’, 1]] [’report’] 1
3 As a UI de-
signer, I want
to move on
to round 2
of DABS or
FABS landing
page edits, so
that I can get
approvals from
leadership.
[’As’, ’a’, ’HEALTH’,
’HEALTH’, ’PERSON’,
’want’, ’to’, ’move’,
’on’, ’to’, ’round’, ’CAR-
DINAL’, ’of’, ’DABS’,
’or’, ’FABS’, ’landing’,
’page’, ’edits’, ’so’,
’that’, ’PERSON’, ’can’,
’get’, ’approvals’, ’from’,
’leadership’]
[’prep’, ’det’, ’compound’,
’pobj’, ’nsubj’, ’ROOT’,
’aux’, ’xcomp’, ’prt’,
’aux’, ’advcl’, ’nummod’,
’prep’, ’pobj’, ’cc’, ’com-
pound’, ’compound’,
’nsubj’, ’conj’, ’mark’,
’mark’, ’nsubj’, ’aux’,
’advcl’, ’dobj’, ’prep’,
’pobj’]
[’SCONJ’, ’DET’, ’PROPN’,
’NOUN’, ’PRON’, ’VERB’,
’PART’, ’VERB’, ’ADV’,
’PART’, ’VERB’, ’NUM’,
’ADP’, ’NOUN’, ’CCONJ’,
’NOUN’, ’NOUN’, ’NOUN’,
’NOUN’, ’SCONJ’, ’SCONJ’,
’PRON’, ’VERB’, ’AUX’,
’NOUN’, ’ADP’, ’NOUN’]
none none 1
Francesco Casillo et al.: Preprint submitted to Elsevier Page 8 of 15
Detecting Privacy Requirements from User Stories
4.3. Evaluation criteria
To evaluate the accuracy of the predictions, we used four
popular evaluation metrics for classification task [49] – Ac-
curacy,Precision,Recall, and F1-score. Accuracy is the
most intuitive performance measure, and it is the ratio of
the correctly predicted observations, i.e., true positive +true
negative, to the total observations. Precision is calculated as
true positive / (true positive +false positive) and indicates
correctness of the responses provided by a technique. Recall
measures the completeness of the responses and is calculated
as true positive / (true positive +false negative). F1-score
is defined as the harmonic mean of precision and recall and
indicates balance between those.
These types of evaluation metrics were firstly considered
because in a binary classification task accuracy, precision
and recall are equally important. Furthermore, these metrics
allowed a comparison between the models implemented in
this work and the pre-trained model evaluated on the same
metrics.
The objective is to try to observe how precise the models
are while identifying aspects of privacy, and to try to under-
stand what are the limitations of their ability to extract as
much as possible such aspects from the test dataset.
Furthermore, it was verified that the predictions obtained
using the different models came from the same population
in order to assess whether the differences observed by ap-
plying the chosen evaluation criterion (i.e., Accuracy and
F1-score) were legitimate or due to coincidence [50]. Note
that, non-parametric techniques are usually preferred [51] to
parametric methods when comparing machine learning and
deep learning models mainly because they make fewer as-
sumptions about the data. Thus, we decided to employ the
McNemar test to compare the performance of two models
[52,53]. In particular, given the predictions of two models,
A and B, and the truth labels, a contingency table is calcu-
lated, which examines the number of instances of the fol-
lowing: 𝑖) Both classifiers were correct; 𝑖𝑖) Both classifiers
were incorrect; 𝑖𝑖𝑖) A was correct and B was incorrect; 𝑖𝑣)
B was correct and A was incorrect. This makes it possible
to estimate the probability that A is better than B at least as
many times as observed in the experiment [52]. For compar-
ing the performance of multiple machine learning and deep
learning classifiers for the research questions in this thesis,
the following null hypothesis was made:
Hn0: All models are equally accurate in identifying as-
pects of privacy.
McNemar’s test allowed to test the null hypothesis by
comparing each pair of models under the same null hypoth-
esis. As usual we considered a p-value of 0.05 as a “sig-
nificance” threshold, i.e., pvalues lower than 0.05 are then
assumed to be “significant”, implying that the results ob-
tained are hardly due to chance, allowing the null hypothesis
to be rejected [52]. Thus, for the comparisons in which the
null hypothesis was successfully rejected, it was determined
whether one classifier was significantly better than the other
classifiers.
4.4. Validation Method
In order to define the degree of accuracy or effectiveness
of a machine learning model, one or more evaluations are
carried out on the errors that are obtained in the predictions.
In that case, after training, an error estimate is made for the
model, called residual evaluation. However, this estimate
only gives an idea of how well the model does on the data
used to train it, as it is possible for the model to be inade-
quate or in excess of the data. Thus, the problem with this
evaluation technique is that it does not give an indication of
how well the learning model will generalize to an indepen-
dent or invisible dataset, i.e., on data it has not already seen.
To this end, we have applied a k-fold cross-validation, di-
viding the original dataset into training and validation sets
𝑘times, considering 𝑘= 5. Furthermore, we run this 5-
cross-validation 40 times. First, the cardinality of the sets
defined by the assumption that determines whether the US
is related to privacy content was analyzed. These cardinali-
ties are shown in Figure 5.
Using cross-validation, the sets defined for training con-
sisted of 664 instances, where 50% are USs containing both
privacy words and disclosures, the remaining 50% being di-
vided between the other three types as described below. The
test set was defined as 166 instances, where 83 consisted of
USs containing both disclosures and privacy words.
4.5. Threats to validity
This part discusses the main threats to validity, explain-
ing their possible effect and how they have been mitigated.
Threats to the validity of this work derive mainly from the
correctness of the tools used, the assumption regarding pri-
vacy content, and the generalizability and repeatability of
the presented results.
Construct Validity It is about making sure that the mea-
surement method corresponds to the construct being mea-
sured and is about the adequacy of the observations and in-
ferences made based on the measurements performed during
the study. In the context of using deep learning techniques
for privacy content detection, methods offered by the Scikit-
learn library have been used. In particular, the f1_score1
method for measuring F1-score, and the accuracy_score2method
for measuring Accuracy were used from Scikit-learn. Rely-
ing on results from a single tool can pose a threat to validity
especially in the case of deep-learning. However, here it was
decided to choose Accuracy and F1-score as they were used
in previous studies investigating an approach to detect pri-
vacy disclosures and this allowed the results obtained here
to be compared with those obtained in those studies. In ad-
dition, since the aim is to compare these deep learning tech-
niques with classical machine learning models, the choice
of these methods was almost obligatory, as the evaluation
methods offered by Keras are not compatible with the Scikit-
Learn models. On the other hand, the metrics offered by the
1https://scikit-learn.org/stable/modules/generated/sklearn.
metrics.f1_score.html#sklearn-metrics- f1-score
2https://scikit-learn.org/stable/modules/generated/sklearn.
metrics.accuracy_score.html
Francesco Casillo et al.: Preprint submitted to Elsevier Page 9 of 15
Detecting Privacy Requirements from User Stories
latter library are based on results and predictions and, there-
fore, are also suitable for neural networks built using Keras.
Internal Validity It refers to the validity of the research
results. It is mainly concerned with validating the control
of extraneous variables and external influences that may im-
pact the result. In the context of this work, exploring the
applicability of transfer-learning for the detection of privacy
aspects, it was assumed that the models used are compatible
with each other, as they are both produced using the same
technology, i.e., Keras. It would be interesting to observe
how two models or neural networks developed through dif-
ferent APIs (e.g., Keras and Pytorch1) can be combined in a
transfer-learning experiment. Another threat to internal va-
lidity could be causality. However, it is assumed that the nec-
essary conditions for causality are quite fulfilled as statisti-
cally significant correlations have been found between mea-
sures obtained via different methods, reinforcing the idea
that these correlations derive from fairly robust causal re-
lationships.
External Validity It relates to the generalizability and
repeatability of the produced results. The approach proposed
in this work is based on Python. However, the statistical
models used are replicable in other programming languages,
so it is assumed that this method is programming language
agnostic and therefore can be repeated for any other pro-
gramming language given the availability of suitable frame-
works. To promote the replication and construction of this
work, as said above, we made available all tools, scripts and
data.
Conclusion Validity It is a measure of how reasonable a
research or experimental conclusion is. Although the num-
ber of observations made on statistical tests is not large, all
the required hypotheses have been proved, therefore, the re-
lationship between the data and the result is considered rea-
sonable.
5. Results and discussion
In this section we present and discuss the results of the
empirical study for each research question addressed.
5.1. RQ1: Is 𝐶 𝑁 𝑁𝑁𝐿𝑃 accurate at least as
conventional machine learning methods to
detect privacy content when using NLP-based
features?
We present the results achieved with the models built
to verify if a deep learning method (CNN) exploiting NLP-
based features performs at least as conventional (shallow)
machine learning methods (our first sanity check).
As described in the previous Section 4.2, the built mod-
els are trained and tested with the same number of positive
and negative samples. In particular, the positive samples are
those with both Disclosures and Privacy Words, so for each
fold 332 positive and 332 negative samples are selected for
the training phase, while for the testing phase 83 positive and
83 negative samples are selected.
1https://pytorch.org/
Table 5
Results achieved with each model to answer RQ1 in terms of
accuracy and F1-score.
Model Accuracy F1-Score
𝐂𝐍𝐍𝐍𝐋𝐏 0.720 0.713
𝐿𝑅𝑁𝐿𝑃 0.617 0.605
𝑆𝑉 𝑀𝑁 𝐿𝑃 0.519 0.084
𝐺𝑁 𝐵𝑁𝐿𝑃 0.510 0.612
𝑘𝑁 𝑁𝑁𝐿𝑃 0.557 0.519
𝑅𝐹 𝐶𝑁𝐿𝑃 0.662 0.669
𝐷𝑇𝑁𝐿𝑃 0.609 0.611
Each of the folds identified for the training was used for
building 𝐶 𝑁 𝑁𝑁𝐿𝑃 (i.e., the model obtained with NLP-based
CNN), 𝐿𝑅𝑁𝐿𝑃 ,𝑆 𝑉 𝑀𝑁 𝐿𝑃 ,𝐺𝑁 𝐵𝑁 𝐿𝑃 ,𝑘𝑁𝑁𝑁 𝐿𝑃 ,𝑅𝐹𝑁 𝐿𝑃 ,
and 𝐷𝑇𝑁 𝐿𝑃 (i.e., the models obtained with conventional ma-
chine learning methods by exploiting Scikit-Learn).
The aggregated results achieved in terms of the employed
evaluation criteria are reported in Table 5, while Figures 6
and 7show the results of all runs graphically. By analyzing
the accuracy and F1-score values reported in Table 5, we can
observe that the results of 𝐶 𝑁 𝑁𝑁𝐿𝑃 are better than those
obtained by the others. Indeed, accuracy and F1-score val-
ues are 0.720 and 0.713 for 𝐶𝑁 𝑁𝑁 𝐿𝑃 , respectively, while
the other machine learning methods have obtained values
less than 0.7. The worst results have been obtained with
𝑆𝑉 𝑀𝑁 𝐿𝑃 .
From Figure 6we can observe that 𝐶𝑁 𝑁𝑁 𝐿𝑃 is char-
acterized by better Accuracy values for all runs, except for
four cases. It is also interesting to note that for all methods
we have a regular trend about the Accuracy values for all
the runs, with a variation of up to 10%. For a few cases we
have variations around 20% in the case of 𝐶 𝑁 𝑁𝑁 𝐿𝑃 . This
is probably due to the syntactic structure of the USs selected
for training phase. In particular, a greater number of posi-
tive and negative samples with a similar syntactic structure
hampers the model to learn the presence of privacy aspects.
From Figure 7we can observe that for 𝐿𝑅𝑁𝐿𝑃 ,𝐾𝑁 𝑁𝑁 𝐿𝑃 ,
𝐷𝑇𝑁 𝐿𝑃 , and 𝑅𝐹 𝐶𝑁𝐿𝑃 we have a regular trend about the F1-
score values for all the runs (with a variation of up to 10%).
Differently, for 𝐶𝑁 𝑁𝑁 𝐿𝑃 ,𝐺𝑁 𝐵𝑁 𝐿𝑃 , and 𝑆 𝑉 𝑀𝑁 𝐿𝑃 we
can note some runs characterized by a variation around 20%.
Only in seven runs 𝐶𝑁 𝑁𝑁 𝐿𝑃 is characterized by F1-score
values below those of other approaches.
As designed we have also verified whether the differ-
ences in the performances are statistically significant. To
this end, we have performed the McNemar test to test the
null hypothesis: “there are no differences in the accuracy of
the models being compared”. In particular, we have com-
pared the predictions achieved with 𝐶𝑁 𝑁𝑁 𝐿𝑃 with those
achieved with each shallow machine learning based model
(i.e., those obtained with LR, SVM, GNB, kNN, RFC, and
DT). For all the performed comparisons, we have obtained
ap-value <0.001, allowing the rejection of the null hypoth-
esis, i.e., there is significant differences between the predic-
tions achieved with 𝐶𝑁 𝑁𝑁 𝐿𝑃 and those achieved with the
Francesco Casillo et al.: Preprint submitted to Elsevier Page 10 of 15
Detecting Privacy Requirements from User Stories
Figure 6: Accuracy values of all the runs (to answer RQ1).
Figure 7: F1-score values of all the runs (to answer RQ1).
employed shallow machine learning based model. We can
also conclude that the further effort needed to apply CNN
is payed back by a significant improvement in the prediction
accuracy.
Thus, we can positively answer our first research
question because a deep learning method (CNN)
has provided better predictions than conventional
(shallow) machine learning methods.
5.2. RQ2 Is 𝐶 𝑁 𝑁𝑃 𝑊 accurate at least as
conventional machine learning methods to
detect privacy content when using PW
features?
This section is devoted to the presentation of results achieved
by models built to answer our second research question RQ2,
i.e., if a deep learning method (CNN) performs at least as
shallow machine learning methods, when PW features are
exploited (our second sanity check).
Similarly to RQ1 analysis, the built models are trained
and tested with the same number of positive and negative
samples (see above). Thus, each of the folds identified for
the training was used for the models built with CNN and LR,
SVM, GNB, kNN, RF, and DT, by exploiting Scikit-Learn
and PW features.
The aggregated results achieved in terms of employed
evaluation criteria are reported in Table 6, while Figures 8
and 9show the results of all runs graphically. By analyz-
ing the Accuracy and F1-score values for the built models
shown in Table 6, we can observe that the values are from
0.80 to 0.85, which can be considered good results, except
for 𝐺𝑁 𝐵𝑃 𝑊 which is characterized by worse performance
with respect to the other employed machine learning meth-
ods.
From Figures 8and 9we can observe that for all meth-
ods we have a regular trend for all the runs (with a varia-
tion of up to 10%), except for 𝐶𝑁 𝑁𝑃 𝑊 where for three runs
the F1-score values are less than those of the other runs of
about 20%, and for just one run the Accuracy value is less
than those of the other runs of about 14%. The best result in
terms of accuracy and F1-score has been obtained by using
𝑅𝐹𝑃 𝑊 .𝑆 𝑉 𝑀𝑃 𝑊 and 𝑘𝑁 𝑁𝑃 𝑊 have also provided better
predictions than 𝐶 𝑁 𝑁𝑃 𝑊 . They are reported in bold in Ta-
ble 6.
As designed we have also verified whether the differ-
ences in the performances are statistically significant, by per-
forming the McNemar test. For all the performed compar-
isons (i.e., 𝐶 𝑁 𝑁𝑃 𝑊 vs 𝐿𝑅𝑃 𝑊 ,𝐶 𝑁 𝑁𝑃 𝑊 vs 𝑆 𝑉 𝑀𝑃 𝑊 ,
𝐶 𝑁 𝑁𝑃 𝑊 vs 𝐺𝑁 𝐵𝑃 𝑊 ,𝐶𝑁 𝑁𝑃 𝑊 vs 𝑘𝑁 𝑁𝑃 𝑊 ,𝐶𝑁 𝑁𝑃 𝑊
vs 𝑅𝐹𝑃 𝑊 ,𝐶 𝑁 𝑁𝑃 𝑊 vs 𝐷𝑇𝑃 𝑊 ), we obtained a p-value <0.001,
allowing the rejection of the null hypothesis, i.e., there is sig-
nificant differences between the predictions achieved using
the two considered models. Thus, 𝐶 𝑁 𝑁𝑃 𝑊 performs bet-
ter than three conventional machine learning methods (i.e.,
𝐿𝑅𝑃 𝑊 ,𝐷𝑇𝑃 𝑊 , and 𝐺𝑁 𝐵𝑃 𝑊 ) and worse than the other
three (i.e., 𝑆𝑉 𝑀𝑃 𝑊 ,𝑘𝑁 𝑁𝑃 𝑊 , and 𝑅𝐹𝑃 𝑊 ), when PW fea-
tures are exploited.
Thus, we cannot positively answer our second re-
search question, i.e., the deep learning methods is
not accurate at least as all the considered conven-
tional machine learning methods to detect privacy
content when using PW features.
We can conclude that the second sanity check has been
particularly useful because it highlights something unexpected,
i.e., a deep learning method is not accurate at least as a con-
ventional machine learning method. But it can happen as
shown in previous similar works (e.g., [15]).
Just for completeness, we want to observe that the pre-
diction models built with the shallow machine learning meth-
ods exploiting PW features are better than those obtained
with the same shallow machine learning methods but ex-
ploiting NLP-based features (see Tables 5and 6). The results
of McNemar test have also revealed that these differences are
statistically significant. Thus, the shallow machine learning
methods improved their performances when trained with a
not so large set of features.
Francesco Casillo et al.: Preprint submitted to Elsevier Page 11 of 15
Detecting Privacy Requirements from User Stories
Table 6
Results achieved with each model to answer RQ2, in terms of
accuracy and F1-score
Model Accuracy F1-Score
𝐶𝑁 𝑁𝑃 𝑊 0.805 0.823
𝐿𝑅𝑃 𝑊 0.801 0.819
𝑆𝑉 𝑀𝑃 𝑊 0.828 0.848
𝐺𝑁 𝐵𝑃 𝑊 0.584 0.343
𝑘𝑁 𝑁𝑃 𝑊 0.810 0.825
𝐑𝐅𝐏𝐖 0.829 0.851
𝐷𝑇𝑃 𝑊 0.805 0.819
Figure 8: Accuracy values of all the runs (to answer RQ2).
Figure 9: F1-score values of all the runs (to answer RQ2).
5.3. RQ3: Are predictions obtained with 𝑃 𝐷𝑇 𝐿
better than those achieved with 𝐶𝑁 𝑁𝑁 𝐿𝑃 and
𝐶 𝑁 𝑁𝑃 𝑊 ?
The main goal of our investigation is the attempt to ap-
ply the technique of Transfer Learning, which consists in us-
ing the knowledge of a model in solving a specific task and
combining it with another model for solving a different task,
expanding the set of features used for prediction.
Similarly to RQ1 and RQ2 analyses, the comparisons
between 𝑃 𝐷𝑇 𝐿,𝐶 𝑁 𝑁𝑁 𝐿𝑃 , and 𝐶 𝑁 𝑁𝑃 𝑊 have been per-
Figure 10: Accuracy values of all the runs (to answer RQ3).
formed in terms of Accuracy and F1-score, whereas the Mc-
Nemar’s statistical test has been used to verify the signifi-
cance of the achieved results.
The aggregated results achieved in terms of employed
evaluation criteria are reported in Table 7, while Figures 10
and 11 show the results of all runs graphically. We can note
that the model resulting from the application of the Transfer
Learning (𝑃 𝐷𝑇 𝐿) has provided better F1-score and Accu-
racy values (i.e., values greater than 0.90) than those achieved
with the models based on deep learning analyzed previously
(i.e., 𝐶 𝑁 𝑁𝑁𝐿𝑃 and 𝐶 𝑁 𝑁𝑃 𝑊 ). In particular, 𝑃 𝐷𝑇 𝐿 sur-
passes 𝐶 𝑁 𝑁𝑃 𝑊 and 𝐶 𝑁 𝑁𝑁 𝐿𝑃 of more than 10% both in
terms of Accuracy and F1-score. Furthermore, the results
of the McNemar test have revealed that the differences are
statistically significant (p-value <0.001 for both the com-
parisons). Furthermore, as clearly shown in Figures 10 and
11 𝑃 𝐷𝑇 𝐿 has provided better results for all the runs except
one, and the distribution of the values is characterized by
less variation with respect to the ones of 𝐶 𝑁 𝑁𝑁𝐿𝑃 and
𝐶 𝑁 𝑁𝑃 𝑊 .
Table 7
Results achieved with each model to answer RQ3, in
terms of accuracy and F1-score
Model Accuracy F1-Score
𝐶𝑁 𝑁𝑁 𝐿𝑃 0.720 0.713
𝐶𝑁 𝑁𝑃 𝑊 0.805 0.823
𝐏𝐃𝐓𝐋 0.937 0.937
Based on the obtained results, it is therefore possible to
state not only that Transfer Learning is feasible but that it
is better than using deep learning models alone for privacy
content analysis.
Thus, we can positively answer our third research
question, i.e., predictions obtained with 𝑃 𝐷𝑇 𝐿 are
better than those achieved with 𝐶𝑁 𝑁𝑁 𝐿𝑃 and
𝐶 𝑁 𝑁𝑃 𝑊 .
Francesco Casillo et al.: Preprint submitted to Elsevier Page 12 of 15
Detecting Privacy Requirements from User Stories
Figure 11: F1-score values of all the runs (to answer RQ3).
5.4. Findings and suggestions for researchers and
practitioners
The analysis carried out to answer our research questions
allows us to highlight implications for researchers and prac-
titioners about the applicability of our findings. We organize
the discussion according to the achieved contributions.
On the use of a tool to predict privacy content. We have pro-
vided an approach and tool to automatically predict pri-
vacy content from user stories (problem never addressed
before), which exploit a combination of NLP and transfer
learning strategies. This should encourage software engi-
neering researchers and in particular practitioners in con-
sidering the opportunities of automating privacy content
detection.
Implication 1. Practitioners have the possibility to ex-
ploit an approach and tool that allow to reduce the ef-
fort (and cost) to identify privacy requirements in the early
phase of design. User studies involving practitioners should
be performed with the aim of promoting the suggested ap-
proach and tool.
On the use of deep learning methods. As expected the ex-
perimental results show that the use of NLP-based CNNs
can contribute to improve predictions about privacy re-
quirements with respect to the use of conventional (shal-
low) machine learning methods. However, the analysis
has also revealed that the strategy for training the models
are crucial. In particular, RQ2 analysis has not highlighted
a clear advantage in using deep learning methods with re-
spect to conventional (shallow) machine learning meth-
ods. Other studies achieved a similar result (e.g., [15]).
Implication 2. Researchers should invest some effort in
conducting empirical studies considering different datasets
aiming at identifying strategies for training NLP-based pre-
diction models in the context of agile for privacy require-
ment detection.
On the use of privacy words. Our analysis has clearly shown
that the use of privacy words allowed us to significantly
improve the predictions of some employed shallow ma-
chine learning methods (if we compare RQ2 results against
RQ1 results). In particular, 𝑅𝐹𝑃 𝑊 ,𝑆 𝑉 𝑀𝑃 𝑊 , and 𝑘𝑁 𝑁𝑃 𝑊
have also provided better predictions than 𝐶𝑁 𝑁𝑁 𝐿𝑃 . Thus,
even cheaper methods can provide good predictions when
exploiting data of the specific domain under investigation.
Implication 3. The research community should invest
some effort in investigating the impact of the specific do-
main data on the use of cheaper methods aiming at ver-
ifying their effectiveness with respect to more expensive
methods.
On the use of Transfer Learning. The main result of our
analysis is about the use of Transfer Learning that has al-
lowed us to improve the performance of the built NLP-
based CNN prediction models of about 10% in terms of ts).
This is a further confirmation of the benefit of using this
emergent strategy, which allows to reuse a system devel-
oped for a task to build a model for a different but related
task [18,19,20].
Implication 4. Researchers should apply Transfer Learn-
ing for training NLP-based prediction models aiming at
improving their effectiveness in detecting privacy as well
as security requirements in the agile context.
6. Conclusions and Future work
Interest in machine learning techniques based on natural
language processing has been growing in recent years, in-
cluding in the field of software engineering. Most of the ex-
isting attempts are focused on the generation of models and
components useful in the different phases of software engi-
neering from customer-specified requirements. On the other
hand, few attempts to capture non-functional requirements
have been documented in the literature, yet they contribute
quite a bit in the evaluation of software quality.
The results of our empirical study have revealed that deep
learning methods can be used for the detection of non-functional
requirements from customer requirements. In particular, it
was found that deep learning models can be used for the
identification of privacy disclosures in user stories, even with
near-optimal performance. Furthermore, the search for re-
cent deep learning techniques has led to the exploration of
Transfer Learning, and therefore the possibility of its appli-
cation in this context has been evaluated. The experiment on
the application of Transfer Learning has demonstrated the
feasibility of practicing this technique in the context of pri-
vacy content detection in user stories.
As for future research directions, there are reasons to ex-
tend this work to a broader scope, including other NFRs, or
experimenting with such techniques for other similar tasks.
Of course, future developments could also focus on improv-
ing the strategies employed in the work. For instance, an
update of the privacy dictionary used or the production of
newer, more elaborate taxonomies could help in this regard.
Further research might involve the adoption of other NLP
techniques for feature extraction, expanding the set on which
Francesco Casillo et al.: Preprint submitted to Elsevier Page 13 of 15
Detecting Privacy Requirements from User Stories
the various models are trained and tested. Eventually, it would
be interesting to analyze the application of Transfer Learn-
ing between models of different nature, both technological
and methodological, in order to better understand in which
contexts and circumstances this technique leads to signifi-
cant improvements.
CRediT authorship contribution statement
Francesco Casillo: Methodology, Software, Validation,
Writing - Original Draft. Vincenzo Deufemia: Conceptu-
alization, Formal analysis, Writing - Review & Editing, Su-
pervision. Carmine Gravino: Conceptualization, Formal
analysis, Writing - Review & Editing, Supervision.
References
[1] I. Sommerville, P. Sawyer, Requirements Engineering: A Good Prac-
tice Guide, Wiley, New York, NY, USA, 1997.
[2] K. Pohl, Requirements Engineering: Fundamentals, Principles, and
Techniques, 1st ed., Springer Publishing Company, Incorporated,
2010.
[3] D. M. Fernández, S. Wagner, M. Kalinowski, M. Felderer, P. Mafra,
A. Vetrò, T. Conte, M.-T. Christiansson, D. Greer, C. Lassenius,
T. Männistö, M. Nayabi, M. Oivo, B. Penzenstadler, D. Pfahl, R. Prik-
ladnicki, G. Ruhe, A. Schekelmann, S. Sen, R. Spinola, A. Tuzcu, J.L.
de la Vara, R. Wieringa, Naming the pain in requirements engineering
- Contemporary problems, causes, and effects in practice, Empirical
software engineering (2017) 2298–2338.
[4] F. Paetsch, A. Eberlein, F. Maurer, Requirements engineering and
agile software development, in: Proceedings of 12th IEEE Interna-
tional Workshops on Enabling Technologies (WETICE 2003), Infras-
tructure for Collaborative Enterprises, 9-11 June 2003, Linz, Austria,
IEEE Computer Society, 2003, pp. 308–313. doi:10.1109/ENABL.2003.
1231428.
[5] Z. Kurtanović, W. Maalej, Automatically classifying functional and
non-functional requirements using supervised machine learning, in:
Proceedings of IEEE 25th International Requirements Engineering
Conference (RE), 2017, pp. 490–495. doi:10.1109/RE.2017.82.
[6] Q. L. Nguyen, Non-functional requirements analysis modeling for
software product lines, in: Proceedings of ICSE Workshop on Mod-
eling in Software Engineering, MiSE 2009, Vancouver, BC, Canada,
May 17-18, 2009, IEEE Computer Society, 2009, pp. 56–61. doi:10.
1109/MISE.2009.5069898.
[7] J. Slankas, L. Williams, Automated extraction of non-functional re-
quirements in available documentation, in: Proceedings of 1st Inter-
national Workshop on Natural Language Analysis in Software Engi-
neering (NaturaLiSE), 2013, pp. 9–16. doi:10.1109/NAturaLiSE.2013.
6611715.
[8] P. Anthonysamy, A. Rashid, R. Chitchyan, Privacy requirements:
Present future, in: Proceedings of IEEE/ACM 39th International Con-
ference on Software Engineering: Software Engineering in Society
Track (ICSE-SEIS), 2017, pp. 13–22. doi:10.1109/ICSE-SEIS.2017.3.
[9] L. Cao, B. Ramesh, Agile requirements engineering practices: An
empirical study, IEEE Software (2008) 60–67.
[10] F. Paetsch, A. Eberlein, F. Maurer, Requirements engineering and
agile software development, in: Proceedings of IEEE International
Workshops on Enabling Technologies: Infrastructure for Collabora-
tive Enterprises, 2003, pp. 308–313. doi:10.1109/ENABL.2003.1231428.
[11] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015)
436–444.
[12] N. Mehdy, C. Kennington, H. Mehrpouyan, Privacy disclosures de-
tection in natural-language text through linguistically-motivated arti-
ficial neural networks, in: Security and Privacy in New Computing
Environments, 2019, pp. 152–177. doi:10.1007/978-3- 030- 21373-2\
_14.
[13] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, L. He, A
survey on text classification: From shallow to deep learning, CoRR
abs/2008.00364 (2020).
[14] J. Haneczok, J. Piskorski, Shallow and deep learning for event re-
latedness classification, Information Processing & Management 57
(2020) 102371.
[15] M. Oleynik, A. Kugic, Z. Kasáč, M. Kreuzthaler, Evaluating shallow
and deep learning strategies for the 2018 n2c2 shared task on clini-
cal text classification, Journal of the American Medical Informatics
Association 26 (2019) 1247–1254.
[16] G. Xu, C. Qi, H. Yu, S. Xu, C. Zhao, J. Yuan, Detecting sensitive
information of unstructured text using convolutional neural network,
in: Proceedings of International Conference on Cyber-Enabled Dis-
tributed Computing and Knowledge Discovery (CyberC), 2019, pp.
474–479. doi:10.1109/CyberC.2019.00087.
[17] J. Neerbeky, I. Assentz, P. Dolog, Taboo: Detecting unstructured sen-
sitive information using recursive neural networks, in: Proceedings
of IEEE 33rd International Conference on Data Engineering (ICDE),
2017, pp. 1399–1400. doi:10.1109/ICDE.2017.195.
[18] L. Torrey, J. Shavlik, Transfer learning, in: Handbook of research on
machine learning applications and trends: algorithms, methods, and
techniques, IGI global, 2010, pp. 242–264.
[19] E. Kocaguneli, T. Menzies, E. Mendes, Transfer learning in effort
estimation, Empir. Softw. Eng. 20 (2015) 813–843.
[20] R. Krishna, T. Menzies, Bellwethers: A baseline method for transfer
learning, IEEE Trans. Software Eng. 45 (2019) 1081–1105.
[21] F. Dalpiaz, Requirements data sets (user stories), 2018. doi:10.17632/
7zbk8zsd8y.1, https://data.mendeley.com/datasets/7zbk8zsd8y/.
[22] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval,
Addison-Wesley, 1999.
[23] G. Lucassen, F. Dalpiaz, J. M. van der Werf, S. Brinkkemper, The use
and effectiveness of user stories in practice, in: M. Daneva, O. Pastor
(Eds.), Requirements Engineering: Foundation for Software Quality,
Springer International Publishing, Cham, 2016, pp. 205–222.
[24] M. Cohn, User Stories Applied: For Agile Software Development,
Addison Wesley, 2004.
[25] S. Jiménez, R. Juárez-Ramírez, A quality framework for evaluating
grammatical structure of user stories to improve external quality, in:
Proceedings of 7th International Conference in Software Engineering
Research and Innovation (CONISOFT), 2019, pp. 147–153. doi:10.
1109/CONISOFT.2019.00029.
[26] G. Lucassen, F. Dalpiaz, J. M. van der Werf, S. Brinkkemper, Forg-
ing high-quality user stories: Towards a discipline for agile require-
ments, in: Proceedings of IEEE 23rd International Requirements En-
gineering Conference (RE), 2015, pp. 126–135. doi:10.1109/RE.2015.
7320415.
[27] P. Heck, A. Zaidman, A quality framework for agile requirements: A
practitioner’s perspective, 2014. arXiv:1406.4692.
[28] W. B. A. Karaa, Z. B. Azzouz, A. Singh, N. Dey, A. S. Ashour, H. B.
Ghézala, Automatic builder of class diagram (ABCD): an application
of UML generation from functional requirements, Softw. Pract. Exp.
46 (2016) 1443–1458.
[29] M. Elallaoui, K. Nafil, R. Touahni, Automatic transformation of user
stories into UML use case diagrams using NLP techniques, Procedia
Computer Science 130 (2018) 42–49. The 9th International Confer-
ence on Ambient Systems, Networks and Technologies (ANT 2018) /
The 8th International Conference on Sustainable Energy Information
Technology (SEIT-2018) / Affiliated Workshops.
[30] S. Nasiri, Y. Rhazali, M. Lahmer, N. Chenfour, Towards a genera-
tion of class diagram from user stories in agile methods, Procedia
Computer Science 170 (2020) 831–837. The 11th International Con-
ference on Ambient Systems, Networks and Technologies (ANT) /
The 3rd International Conference on Emerging Data and Industry 4.0
(EDI40) / Affiliated Workshops.
[31] G. Lucassen, M. Robeer, F. Dalpiaz, J. M. E. M. van der Werf,
S. Brinkkemper, Extracting conceptual models from user stories with
visual narrator, Requir. Eng. 22 (2017) 339–358.
[32] M. Robeer, G. Lucassen, J. M. van der Werf, F. Dalpiaz, S. Brinkkem-
Francesco Casillo et al.: Preprint submitted to Elsevier Page 14 of 15
Detecting Privacy Requirements from User Stories
per, Automated extraction of conceptual models from user stories via
NLP, in: Proceedings of IEEE 24th International Requirements En-
gineering Conference (RE), 2016, pp. 196–205. doi:10.1109/RE.2016.
40.
[33] F. Gilson, C. Irwin, From user stories to use case scenarios towards
a generative approach, in: Proceedings of 25th Australasian Soft-
ware Engineering Conference (ASWEC), 2018, pp. 61–65. doi:10.
1109/ASWEC.2018.00016.
[34] L. Müter, T. Deoskar, M. Mathijssen, S. Brinkkemper, F. Dalpiaz,
Refinement of user stories into backlog items: Linguistic structure
and action verbs, in: E. Knauss, M. Goedicke (Eds.), Requirements
Engineering: Foundation for Software Quality, Springer International
Publishing, Cham, 2019, pp. 109–116.
[35] P. Rane, Automatic Generation of Test Cases for Agile using Natural
Language Processing, Ph.D. thesis, Virginia Tech, 2017.
[36] F. Gilson, M. Galster, F. Georis, Extracting quality attributes from
user stories for early architecture decision making, in: Proceedings of
IEEE International Conference on Software Architecture Companion
(ICSA-C), 2019, pp. 129–136. doi:10.1109/ICSA-C.2019.00031.
[37] H. Villamizar, A. Anderlin Neto, M. Kalinowski, A. Garcia, D. Mén-
dez, An approach for reviewing security-related aspects in agile re-
quirements specifications of web applications, in: Proceedings of
IEEE 27th International Requirements Engineering Conference (RE),
2019, pp. 86–97. doi:10.1109/RE.2019.00020.
[38] M. Riaz, J. King, J. Slankas, L. Williams, Hidden in plain sight: Auto-
matically identifying security requirements from natural language ar-
tifacts, in: Proceedings of IEEE 22nd International Requirements En-
gineering Conference (RE), 2014, pp. 183–192. doi:10.1109/RE.2014.
6912260.
[39] K. Barker, M. Askari, M. Banerjee, K. Ghazinour, B. Mackas, M. Ma-
jedi, S. Pun, A. Williams, A data privacy taxonomy, in: Proceedings
of the 26th British National Conference on Databases: Dataspace:
The Final Frontier, BNCOD 26, Springer-Verlag, Berlin, Heidelberg,
2009, p. 42–54. doi:10.1007/978-3- 642-02843- 4\_7.
[40] S. De Capitani Di Vimercati, S. Foresti, G. Livraga, P. Samarati, Data
privacy: Definitions and techniques, International Journal of Uncer-
tainty, Fuzziness and Knowledge-Based Systems 20 (2012) 793–817.
[41] A. J. Gill, A. Vasalou, C. Papoutsi, A. N. Joinson, Privacy dic-
tionary: A linguistic taxonomy of privacy for content analysis, in:
Proceedings of the SIGCHI Conference on Human Factors in Com-
puting Systems, ACM, New York, NY, USA, 2011, p. 3227–3236.
doi:10.1145/1978942.1979421.
[42] A. Vasalou, A. Gill, F. Mazanderani, C. Papoutsi, A. Joinson, Privacy
dictionary: A new resource for the automated content analysis of pri-
vacy, Journal of the American Society for Information Science and
Technology (JASIST) 62 (2011) 2095–2105.
[43] P. Silva, C. Gonçalves, C. Godinho, N. Antunes, M. Curado, Us-
ing NLP and machine learning to detect data privacy violations,
in: Proceedings of IEEE Conference on Computer Communications
Workshops, 2020, pp. 972–977. doi:10.1109/INFOCOMWKSHPS50562.
2020.9162683.
[44] W. B. Tesfay, J. Serna, K. Rannenberg, Privacybot: Detecting pri-
vacy sensitive information in unstructured texts, in: Proceedings of
Sixth International Conference on Social Networks Analysis, Man-
agement and Security (SNAMS), 2019, pp. 53–60. doi:10.1109/SNAMS.
2019.8931855.
[45] S. Sheth, G. Kaiser, W. Maalej, Us and them: A study of privacy
requirements across north america, asia, and europe, in: Proceedings
of the 36th International Conference on Software Engineering, ICSE
2014, Association for Computing Machinery, New York, NY, USA,
2014, p. 859–870. doi:10.1145/2568225.2568244.
[46] D. A. Evans, C. Zhai, Noun phrase analysis in unrestricted text
for information retrieval, in: Proceedings of 34th Annual Meeting
of the Association for Computational Linguistics, ACL, Santa Cruz,
California, USA, 1996, pp. 17–24. URL: https://aclanthology.org/
P96-1003. doi:10.3115/981863.981866.
[47] T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-
based neural machine translation, ArXiv abs/1508.04025 (2015).
[48] F. Dalpiaz, I. Van Der Schalk, S. Brinkkemper, F. B. Aydemir, G. Lu-
cassen, Detecting terminological ambiguity in user stories: Tool and
experimentation, Inf. Softw. Technol. 110 (2019) 3–16.
[49] D. M. W. Powers, Evaluation: from precision, recall and f-measure to
ROC, informedness, markedness and correlation, Journal of Machine
Learning Technologies 2 (2011) 37–63.
[50] R. L. Wasserstein, N. A. Lazar, The ASA statement on p-values: Con-
text, process, and purpose, The American Statistician 70 (2016) 129–
133.
[51] A. Fernández, S. García, J. Luengo, E. Ber nadó, F. Herrera, Genetics-
based machine learning for rule induction: State of the art, taxonomy,
and comparative study, IEEE Transactions on Evolutionary Compu-
tation 14 (2011) 913 – 941.
[52] S. Salzberg, On comparing classifiers: Pitfalls to avoid and a recom-
mended approach, Data Mining and Knowledge Discovery 1 (1997)
317–328.
[53] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classi-
fication Perspective, Cambridge University Press, 2011. doi:10.1017/
CBO9780511921803.
Francesco Casillo et al.: Preprint submitted to Elsevier Page 15 of 15
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the two recent decades various security authorities around the world acknowledged the importance of exploiting the ever-growing amount of information published on the web on various types of events for early detection of certain threats, situation monitoring and risk analysis. Since the information related to a particular real-world event might be scattered across various sources and mentioned on different dates, an important task is to link together all event mentions that are interrelated. This article studies the application of various statistical and machine learning techniques to solve a new application-oriented variation of the task of event pair relatedness classification, which merges different fine-grained event relation types reported elsewhere into one concept. The task focuses on linking event templates automatically extracted from online news by an existing event extraction system, which contain only short text snippets, and potentially erroneous and incomplete information. Results of exploring the performance of shallow learning methods such as decision tree-based random forest and gradient boosted tree ensembles (XGBoost) along with kernel-based support vector machines (SVM) are presented in comparison to both simpler shallow learners as well as a deep learning approach based on long short-term memory (LSTM) recurrent neural network. Our experiments focus on using linguistically lightweight features (some of which not reported elsewhere) which are easily portable across languages. We obtained F1 scores ranging from 92% (simplest shallow learner) to 96.4% (LSTM-based recurrent neural network) evaluated on a newly created event linking corpus.
Article
Full-text available
Model-Driven Architecture (MDA) is a framework for software development processes that allows an automatic transformation from a business process model to the code model. In MDA there are two transformation kinds: Transformation from the Computation independent model (CIM) to platform-independent model (PIM), and transformation from PIM to platform-specific model (PSM). In this paper, we based on CIM to PIM transformation. This transformation is done by developing a platform that generates a class diagram, presented in XMI file, from specifications that are presented in user stories, which are written in natural language (English). We used a natural language processing (NLP) tool named "Stanford CoreNLP" for extracting of the object-oriented design elements. Applying our approach to several case studies has given good results.
Article
Full-text available
Objective: Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to evaluate shallow and deep learning text classifiers and the impact of pretrained embeddings in a small clinical dataset. Materials and methods: We participated in the 2018 National NLP Clinical Challenges (n2c2) Shared Task on cohort selection and received an annotated dataset with medical narratives of 202 patients for multilabel binary text classification. We set our baseline to a majority classifier, to which we compared a rule-based classifier and orthogonal machine learning strategies: support vector machines, logistic regression, and long short-term memory neural networks. We evaluated logistic regression and long short-term memory using both self-trained and pretrained BioWordVec word embeddings as input representation schemes. Results: Rule-based classifier showed the highest overall micro F1 score (0.9100), with which we finished first in the challenge. Shallow machine learning strategies showed lower overall micro F1 scores, but still higher than deep learning strategies and the baseline. We could not show a difference in classification efficiency between self-trained and pretrained embeddings. Discussion: Clinical context, negation, and value-based criteria hindered shallow machine learning approaches, while deep learning strategies could not capture the term diversity due to the small training dataset. Conclusion: Shallow methods for clinical phenotyping can still outperform deep learning methods in small imbalanced data, even when supported by pretrained embeddings.
Conference Paper
Full-text available
Defects in requirements specifications can have severe consequences during the software development lifecycle. Some of them result in overall project failure due to incorrect or missing quality characteristics such as security. There are several concerns that make security difficult to deal with; for instance, (1) when stakeholders discuss general requirements in (review) meetings, they are often not aware that they should also discuss security-related topics, and (2) they typically do not have enough security expertise. These concerns become even more challenging in agile development contexts, where lightweight documentation is typically involved. The goal of this paper is to design and evaluate an approach to support reviewing security-related aspects in agile requirements specifications of web applications. The designed approach considers user stories and security specifications as input and relates those user stories to security properties via Natural Language Processing (NLP) techniques. Based on the related security properties, our approach then identifies high-level security requirements from the Open Web Application Security Project (OWASP) to be verified and generates a focused reading techniques to support reviewers in detecting detects. We evaluate our approach via two controlled experiment trials. We compare the effectiveness and efficiency of novice inspectors verifying security aspects in agile requirements using our reading technique against using the complete list of OWASP high-level security requirements. The (statistically significant) results indicate that using the reading technique has a positive impact (with very large effect size) on the performance of inspectors in terms of effectiveness and efficiency.
Chapter
Full-text available
An increasing number of people are sharing information through text messages, emails, and social media without proper privacy checks. In many situations, this could lead to serious privacy threats. This paper presents a methodology for providing extra safety precautions without being intrusive to users. We have developed and evaluated a model to help users take control of their shared information by automatically identifying text (i.e., a sentence or a transcribed utterance) that might contain personal or private disclosures. We apply off-the-shelf natural language processing tools to derive linguistic features such as part-of-speech, syntactic dependencies, and entity relations. From these features, we model and train a multichannel convolutional neural network as a classifier to identify short texts that have personal, private disclosures. We show how our model can notify users if a piece of text discloses personal or private information, and evaluate our approach in a binary classification task with 93% accuracy on our own labeled dataset, and 86% on a dataset of ground truth. Unlike document classification tasks in the area of natural language processing, our framework is developed keeping the sentence level context into consideration.
Conference Paper
Full-text available
Software quality attributes (e.g., security, performance) influence software architecture design decisions, e.g., when choosing technologies, patterns or tactics. As software developers are moving from big upfront design to an evolutionary or emerging design, the architecture of a system evolves as more functionality is added. In agile software development, functional user requirements are often expressed as user stories. Quality attributes might be implicitly referenced in user stories. To support a more systematic analysis and reasoning about quality attributes in agile development projects, this paper explores how to automatically identify quality attributes from user stories. This could help better understand relevant quality attributes (and potential architectural key drivers) before analysing product backlogs and domains in detail and provides the "bigger picture" of potential architectural drivers for early architecture decision making. The goal of this paper is to present our vision and preliminary work towards understanding whether user stories do include information about quality attributes at all, and if so, how we can identify such information in an automated manner. Index Terms-agile software development, software architecture , decision making, machine learning, natural language processing