Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Context To provide privacy-aware software systems, it is crucial to consider privacy from the very beginning of the development. However, developers do not have the expertise and the knowledge required to embed the legal and social requirements for data protection into software systems. Objective We present an approach to decrease privacy risks during agile software development by automatically detecting privacy-related information in the context of user story requirements, a prominent notation in agile Requirement Engineering (RE). Methods The proposed approach combines Natural Language Processing (NLP) and linguistic resources with deep learning algorithms to identify privacy aspects into User Stories. NLP technologies are used to extract information regarding the semantic and syntactic structure of the text. This information is then processed by a pre-trained convolutional neural network, which paved the way for the implementation of a Transfer Learning technique. We evaluate the proposed approach by performing an empirical study with a dataset of 1680 user stories. Results The experimental results show that deep learning algorithms allow to obtain better predictions than those achieved with conventional (shallow) machine learning methods. Moreover, the application of Transfer Learning allows to considerably improve the accuracy of the predictions, ca. 10%. Conclusions Our study contributes to encourage software engineering researchers in considering the opportunities to automate privacy detection in the early phase of design, by also exploiting transfer learning models.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... NLP has been used in many different fields such as email control systems, computer security, and requirement engineering [5,8]. It is useful in identifying word similarity. ...
Article
Nowadays, customer-centric innovation is important in affective design, leading to the design and development of a product that fits the needs of a group of target customers for sales and marketing. Though there is much research on customer-centric innovation and affective product design, designing a novel and innovative product that appeals to customers remains difficult. This is not only due to the difficulty of knowing a customer's preference, enabling the product functionality, etc., but it is also very complex to optimize both the technical and aesthetical design factors and parameters for a group of customers. The technical design specification is one of the critical aspects in designing innovative technological products. In fact, a similar group of target customers usually has hidden preferences that provide us with clues for identifying the design of a product. To this end, we propose a sophisticated optimization technique using the Hybrid Fuzzy-based Analytical Hierarchy Process and the Integral-based Taguchi Method called the Fuzzy multi-criteria decision-making (MCDM) approach to determine the essential factors and design parameters in order to understand the preferences of customers and the technical requirements of engineers for designing the enclosure of a product. We demonstrate our methodology using the virtual reality (VR) Headset, which can be used for visualizing the metaverse in the virtual environment. The fuzzy MCDM approach suggested combines technical and aesthetical features to enhance the robustness of product design.
... Therefore, transfer learning may be another potential solution. Transfer learning has shown great potential value in image segmentation (Banerjee et al., 2022), image classification (Ju et al., 2021), natural language processing (Casillo et al., 2022) and other fields. Too et al. (2019) and Chen et al. (2020) reported the analysis of state-of-the-art deep transfer learning models for plant disease and species classification with fine-tuning strategy. ...
Article
Full-text available
Different quality grades of tea tend to have a high degree of similarity in appearance. Traditional image‐based identification methods have limited effects, while complex deep learning architectures require much data and long‐term training. In this paper, two tea quality identification methods based on deep convolutional neural networks and transfer learning are proposed. Different types and quality of tea images are collected by a self‐designed computer vision system to form a data set, which is small‐scale and of high inter‐ and intraclass similarity. The first method uses three simplified convolutional neural network (CNN) models with different image input sizes to identify the quality of tea. The second method performs transfer learning to identify the tea quality by fine‐tuning the mature AlexNet and ResNet50 architecture. Classification performance and model complexity are measured and compared. The related application software is also developed. The results show that the performance of the CNN models and the transfer learning models are close, and both can achieve high identification accuracy. However, the complexity of the CNN models is two to three orders of magnitude lower than that of the transfer learning models. The study shows that deep CNNs and transfer learning have great potential to be rapid and effective methods for automated tea quality identification tasks with high inter‐ and intrasimilarity.
... Typically, this involves a layer that performs a multiplication or another pointwise multiplication and consists of an activation function. This is followed by other layers, such as pooling layers, fully connected layers, and normalization layers [66]. ...
Article
Full-text available
Today, internet and social media is used by many people, both for communication and for expressing opinions about various topics in many domains of life. Various artificial intelligence technologies-based approaches on analysis of these opinions have emerged natural language processing in the name of different tasks. One of these tasks is Sentiment analysis, which is a popular method aiming the task of analyzing people’s opinions which provides a powerful tool in making decisions for people, companies, governments, and researchers. It is desired to investigate the effect of using multi-layered and different neural networks together on the performance of the model to be developed in the sentiment analysis task. In this study, a new, deep learning-based model was proposed for sentiment analysis on IMDB movie reviews dataset. This model performs sentiment classification on vectorized reviews using two methods of Word2Vec, namely, the Skip Gram and Continuous Bag of Words, in three different vector sizes (100, 200, 300), with the help of 6 Bidirectional Gated Recurrent Units and 2 Convolution layers (MBi-GRUMCONV). In the experiments conducted with the proposed model, the dataset was split into 80%-20% and 70%-30% training-test sets, and 10% of the training splits were used for validation purposes. Accuracy and F1 score criteria were used to evaluate the classification performance. The 95.34% accuracy of the proposed model has outperformed the studies in the literature. As a result of the experiments, it was found that Skip Gram has a better contribution to classification success.
... 2) Disclosure Detection: Not every user stories are subject to privacy requirements. To identify user stories that might have privacy issues, we use text classification trained on data provided by Casillo [6]. The model can give probability of whether a user story can potentially disclose a privacy sensitive information. ...
... But it is difficult to accumulate such an order of magnitude of data in medicine, which usually can collect only thousands of images, let alone rare diseases. Transfer learning [22] is a machine learning method that focuses on storing solutions to existing problems and leveraging them to other different but related problems, without the need to re-collect and calibrate huge new data sets (and sometimes may be unavailable at all) at great cost. For emerging fields, it can be transferred and applied quickly, reflecting the advantages of timeliness. ...
Article
Objective: To investigate the diagnostic value of deep learning (DL) in differentiating otitis media (OM) caused by otitis media with effusion (OME) and primary ciliary dyskinesia (PCD), so as to provide reference for early intervention. Methods: From January 2010 to January 2021, 31 patients with PCD who had temporal bone computed tomography (TBCT) in the Children's Hospital of Fudan University were retrospectively analyzed. Another 30 age-matched cases of OME with TBCT were collected as the control group. The CT imaging signatures of children were observed. Besides, a variety of DL neural network training models were established based on PyTorch, and the optimal models were trained and selected for PCD screening. Results: The google net-trained model worked best, with an accuracy of 0.99. Vgg16_bn, vgg19_bn, resnet18, and resnet34; having neural networks with fewer layers, better model effects, with an accuracy rate of 0.86, 0.9, 0.86, and 0.86, respectively. Resnet50 and other neural networks with more layers had relatively poor results. Conclusion: DL-based CT radiomics can accurately distinguish OM caused by OME from that induced by PCD, which can be used for screening the PCD.
... Casillo et al. [11] proposed an approach to decrease data privacy-related risks during agile software development. The authors used natural language processing and deep learning algorithms to automatically detect privacy requirements in 1680 user stories from various software projects. ...
Article
Full-text available
During the software development process and throughout the software lifecycle, organizations must guarantee users’ privacy by protecting personal data. There are several studies in the literature proposing methodologies, techniques, and tools for privacy requirements elicitation. These studies report that practitioners must use systematic approaches to specify these requirements during initial software development activities to avoid users’ data privacy breaches. The main goal of this study is to identify which methodologies, techniques, and tools are used in privacy requirements elicitation in the literature. We have also investigated Information Technology (IT) practitioners’ perceptions regarding the methodologies, techniques, and tools identified in the literature. We have carried out a systematic literature review (SLR) to identify the methodologies, techniques, and tools used for privacy requirements elicitation. Besides, we have surveyed IT practitioners to understand their perception of using these techniques and tools in the software development process. We have found several methodologies, techniques, and tools proposed in the literature to carry out privacy requirements elicitation. Out of 78 studies cataloged within the SLR, most of them did not verify their methodologies and techniques in a practical case study or illustrative contexts (38 studies), and less than 35% of them (26 studies) experimented with their propositions within an industry context. The Privacy Safeguard method (PriS) is the best known among the 198 practitioners in the industry who participated in the survey. Moreover, use cases and user story are their most-used techniques. This qualitative and quantitative study shows a perception of IT practitioners different from those presented in other research papers and suggests that methodologies, techniques, and tools play an important role in IT practitioners’ perceptions about privacy requirements elicitation.
Article
Full-text available
Text classification is the most fundamental and essential task in natural language processing. The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021, focusing on models from traditional models to deep learning. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification. We then discuss each of these categories in detail, dealing with both the technical developments and benchmark datasets that support tests of predictions. A comprehensive comparison between different techniques, as well as identifying the pros and cons of various evaluation metrics are also provided in this survey. Finally, we conclude by summarizing key implications, future research directions, and the challenges facing the research area.
Article
Full-text available
In the two recent decades various security authorities around the world acknowledged the importance of exploiting the ever-growing amount of information published on the web on various types of events for early detection of certain threats, situation monitoring and risk analysis. Since the information related to a particular real-world event might be scattered across various sources and mentioned on different dates, an important task is to link together all event mentions that are interrelated. This article studies the application of various statistical and machine learning techniques to solve a new application-oriented variation of the task of event pair relatedness classification, which merges different fine-grained event relation types reported elsewhere into one concept. The task focuses on linking event templates automatically extracted from online news by an existing event extraction system, which contain only short text snippets, and potentially erroneous and incomplete information. Results of exploring the performance of shallow learning methods such as decision tree-based random forest and gradient boosted tree ensembles (XGBoost) along with kernel-based support vector machines (SVM) are presented in comparison to both simpler shallow learners as well as a deep learning approach based on long short-term memory (LSTM) recurrent neural network. Our experiments focus on using linguistically lightweight features (some of which not reported elsewhere) which are easily portable across languages. We obtained F1 scores ranging from 92% (simplest shallow learner) to 96.4% (LSTM-based recurrent neural network) evaluated on a newly created event linking corpus.
Article
Full-text available
Model-Driven Architecture (MDA) is a framework for software development processes that allows an automatic transformation from a business process model to the code model. In MDA there are two transformation kinds: Transformation from the Computation independent model (CIM) to platform-independent model (PIM), and transformation from PIM to platform-specific model (PSM). In this paper, we based on CIM to PIM transformation. This transformation is done by developing a platform that generates a class diagram, presented in XMI file, from specifications that are presented in user stories, which are written in natural language (English). We used a natural language processing (NLP) tool named "Stanford CoreNLP" for extracting of the object-oriented design elements. Applying our approach to several case studies has given good results.
Article
Full-text available
Objective: Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to evaluate shallow and deep learning text classifiers and the impact of pretrained embeddings in a small clinical dataset. Materials and methods: We participated in the 2018 National NLP Clinical Challenges (n2c2) Shared Task on cohort selection and received an annotated dataset with medical narratives of 202 patients for multilabel binary text classification. We set our baseline to a majority classifier, to which we compared a rule-based classifier and orthogonal machine learning strategies: support vector machines, logistic regression, and long short-term memory neural networks. We evaluated logistic regression and long short-term memory using both self-trained and pretrained BioWordVec word embeddings as input representation schemes. Results: Rule-based classifier showed the highest overall micro F1 score (0.9100), with which we finished first in the challenge. Shallow machine learning strategies showed lower overall micro F1 scores, but still higher than deep learning strategies and the baseline. We could not show a difference in classification efficiency between self-trained and pretrained embeddings. Discussion: Clinical context, negation, and value-based criteria hindered shallow machine learning approaches, while deep learning strategies could not capture the term diversity due to the small training dataset. Conclusion: Shallow methods for clinical phenotyping can still outperform deep learning methods in small imbalanced data, even when supported by pretrained embeddings.
Conference Paper
Full-text available
Defects in requirements specifications can have severe consequences during the software development lifecycle. Some of them result in overall project failure due to incorrect or missing quality characteristics such as security. There are several concerns that make security difficult to deal with; for instance, (1) when stakeholders discuss general requirements in (review) meetings, they are often not aware that they should also discuss security-related topics, and (2) they typically do not have enough security expertise. These concerns become even more challenging in agile development contexts, where lightweight documentation is typically involved. The goal of this paper is to design and evaluate an approach to support reviewing security-related aspects in agile requirements specifications of web applications. The designed approach considers user stories and security specifications as input and relates those user stories to security properties via Natural Language Processing (NLP) techniques. Based on the related security properties, our approach then identifies high-level security requirements from the Open Web Application Security Project (OWASP) to be verified and generates a focused reading techniques to support reviewers in detecting detects. We evaluate our approach via two controlled experiment trials. We compare the effectiveness and efficiency of novice inspectors verifying security aspects in agile requirements using our reading technique against using the complete list of OWASP high-level security requirements. The (statistically significant) results indicate that using the reading technique has a positive impact (with very large effect size) on the performance of inspectors in terms of effectiveness and efficiency.
Chapter
Full-text available
An increasing number of people are sharing information through text messages, emails, and social media without proper privacy checks. In many situations, this could lead to serious privacy threats. This paper presents a methodology for providing extra safety precautions without being intrusive to users. We have developed and evaluated a model to help users take control of their shared information by automatically identifying text (i.e., a sentence or a transcribed utterance) that might contain personal or private disclosures. We apply off-the-shelf natural language processing tools to derive linguistic features such as part-of-speech, syntactic dependencies, and entity relations. From these features, we model and train a multichannel convolutional neural network as a classifier to identify short texts that have personal, private disclosures. We show how our model can notify users if a piece of text discloses personal or private information, and evaluate our approach in a binary classification task with 93% accuracy on our own labeled dataset, and 86% on a dataset of ground truth. Unlike document classification tasks in the area of natural language processing, our framework is developed keeping the sentence level context into consideration.
Conference Paper
Full-text available
Software quality attributes (e.g., security, performance) influence software architecture design decisions, e.g., when choosing technologies, patterns or tactics. As software developers are moving from big upfront design to an evolutionary or emerging design, the architecture of a system evolves as more functionality is added. In agile software development, functional user requirements are often expressed as user stories. Quality attributes might be implicitly referenced in user stories. To support a more systematic analysis and reasoning about quality attributes in agile development projects, this paper explores how to automatically identify quality attributes from user stories. This could help better understand relevant quality attributes (and potential architectural key drivers) before analysing product backlogs and domains in detail and provides the "bigger picture" of potential architectural drivers for early architecture decision making. The goal of this paper is to present our vision and preliminary work towards understanding whether user stories do include information about quality attributes at all, and if so, how we can identify such information in an automated manner. Index Terms-agile software development, software architecture , decision making, machine learning, natural language processing
Chapter
Full-text available
[Context and motivation] In agile system development methods, product backlog items (or tasks) play a prominent role in the refinement process of software requirements. Tasks are typically defined manually to operationalize how to implement a user story; tasks formulation often exhibits low quality, perhaps due to the tedious nature of decomposing user stories into tasks. [Question/Problem] We investigate the process through which user stories are refined into tasks. [Principal ideas/results] We study a large collection of backlog items (N = 1,593), expressed as user stories and sprint tasks, looking for linguistic patterns that characterize the required feature of the user story requirement. Through a linguistic analysis of sentence structures and action verbs (the main verb in the sentence that indicates the task), we discover patterns of labeling refinements, and explore new ways for refinement process improvement. [Contribution] By identifying a set of 7 elementary action verbs and a template for task labels, we make first steps towards comprehending the refinement of user stories to backlog items.
Article
Full-text available
Context. Defects such as ambiguity and incompleteness are pervasive in software requirements, often due to the limited time that practitioners devote to writing good requirements. Objective.We study whether a synergy between humans’ analytic capabilities and natural language processing is an effective approach for quickly identifying near-synonyms, a possible source of terminological ambiguity. Method.We propose a tool-supported approach that blends information visualization with two natural language processing techniques: conceptual model extraction and semantic similarity. We evaluate the precision and recall of our approach compared to a pen-and-paper manual inspection session through a controlled quasi-experiment that involves 57 participants organized into 28 groups, each group working on one real-world requirements data set. Results.The experimental results indicate that manual inspection delivers higher recall (statistically significant with p ≤ 0.01) and non-significantly higher precision. Based on qualitative observations, we analyze the quantitative results and suggest interpretations that explain the advantages and disadvantages of each approach. Conclusions.Our experiment confirms conventional wisdom in requirements engineering: identifying terminological ambiguities is time consuming, even when with tool support; and it is hard to determine whether a near-synonym may challenge the correct development of a software system. The results suggest that the most effective approach may be a combination of manual inspection with an improved version of our tool.
Conference Paper
Full-text available
User stories are increasingly adopted as the basis of requirement engineering artefacts in Agile Software Development. Surveys have shown that user stories are perceived as being effective at describing the main goals of a system. But the continuous management of a product backlog may be particularly time-consuming and error-prone, especially when assessing the quality or scope of user stories and keeping an eye on the system's big picture. On the other hand, models have been recognised as effective tools for communication and analysis purposes. In this research, we propose a generative approach to create robustness diagrams, \textit{i.e.} a form of semi-formal use case scenarios, from the automated analysis of user stories. Stories are transformed into diagrams, enabling requirement engineers and users to validate the main concepts and functional steps behind stories and discover malformed or redundant stories. Such models also open the door for automated systematic analysis.
Article
Full-text available
Agile methods in general and the Scrum method in particular are gaining more and more trust from the software developer community. When it comes to writing a functional requirement, user stories become more and more usable by the community. Furthermore, a considerable effort has already been made by the community in relation to the use of the use case tool when drafting requirements and in terms of model transformation. We have reached a certain stage of maturity at this level. The idea of our paper is to profit from these richness and to invest it in the drafting of user stories. In this paper, we propose a process of transforming user stories into use cases and we will be able to benefit from all the work done in the transformation of the models according to the MDA approach. To do this, we used natural language processing (NLP) techniques, by applying TreeTagger parser. Our work was validated by a case study where we were able to obtain very positive precisions between 87% and 98%.
Article
Full-text available
Extracting conceptual models from natural language requirements can help identify dependencies, redundancies, and conflicts between requirements via a holistic and easy-to-understand view that is generated from lengthy textual specifications. Unfortunately, existing approaches never gained traction in practice, because they either require substantial human involvement or they deliver too low accuracy. In this paper, we propose an automated approach called Visual Narrator based on natural language processing that extracts conceptual models from user story requirements. We choose this notation because of its popularity among (agile) practitioners and its focus on the essential components of a requirement: Who? What? Why? Coupled with a careful selection and tuning of heuristics, we show how Visual Narrator enables generating conceptual models from user stories with high accuracy. Visual Narrator is part of the holistic Grimm method for user story collaboration that ranges from elicitation to the interactive visualization and analysis of requirements.
Conference Paper
Full-text available
[Context and motivation] User stories are an increasingly popular textual notation to capture requirements in agile software development. [Question/Problem] To date there is no scientific evidence on the effectiveness of user stories. The goal of this paper is to explore how practicioners perceive this artifact in the context of requirements engineering. [Principal ideas/results] We explore perceived effectiveness of user stories by reporting on a survey with 182 responses from practitioners and 21 follow-up semi-structured interviews. The data shows that practitioners agree that using user stories, a user story template and quality guidelines such as the INVEST mnemonic improve their productivity and the quality of their work deliverables. [Contribution] By combining the survey data with 21 semi-structured follow-up interviews, we present 12 findings on the usage and perception of user stories by practitioners that employ user stories in their everyday work environment.
Article
Full-text available
Requirements Engineering (RE) has received much attention in research and practice due to its importance to software project success. Its inter-disciplinary nature, the dependency to the customer, and its inherent uncertainty still render the discipline dicult to investigate. This results in a lack of empirical data. These are necessary, however, to demonstrate which practically relevant RE problems exist and to what extent they matter. Motivated by this situation, we initiated the Naming the Pain in Requirements Engineering (NaPiRE) initiative which constitutes a globally distributed, bi-yearly replicated family of surveys on the status quo and problems in practical RE. In this article, we report on the qualitative analysis of data obtained from 228 companies working in 10 countries in various domains and we reveal which contemporary problems practitioners encounter. To this end, we analyse 21 problems derived from the literature with respect to their relevance and criticality in dependency to their context, and we complement this picture with a cause-e↵ect analysis showing the causes and e↵ects surrounding the most critical problems. Our results give us a better understanding of which problems exist and how they manifest themselves in practical environments. Thus, we provide a first step to ground contributions to RE on empirical observations which, until now, were dominated by conventional wisdom only.
Conference Paper
Full-text available
User stories are a widely used notation for formulating requirements in agile development. Despite their popularity in industry, little to no academic work is available on determining their quality. The few existing approaches are too generic or employ highly qualitative metrics. We propose the Quality User Story Framework, consisting of 14 quality criteria that user stories should strive to conform to. Additionally, we introduce the conceptual model of a user story, which we rely on to subsequently design the AQUSA tool. This conceptual piece of software aids requirements engineers in turning raw user stories into higher quality ones by exposing defects and deviations from good practice in user stories. We evaluate our work by applying the framework and a prototype implementation to multiple case studies.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Article
Full-text available
The classification problem can be addressed by numerous techniques and algorithms which belong to different paradigms of machine learning. In this paper, we are interested in evolutionary algorithms, the so-called genetics-based machine learning algorithms. In particular, we will focus on evolutionary approaches that evolve a set of rules, i.e., evolutionary rule-based systems, applied to classification tasks, in order to provide a state of the art in this field. This paper has a double aim: to present a taxonomy of the genetics-based machine learning approaches for rule induction, and to develop an empirical analysis both for standard classification and for classification with imbalanced data sets. We also include a comparative study of the genetics-based machine learning (GBML) methods with some classical non-evolutionary algorithms, in order to observe the suitability and high potential of the search performed by evolutionary algorithms and the behavior of the GBML algorithms in contrast to the classical approaches, in terms of classification accuracy.
Conference Paper
Full-text available
Privacy is frequently a key concern relating to technology and central to HCI research, yet it is notoriously difficult to study in a naturalistic way. In this paper we describe and evaluate a dictionary of privacy designed for content analysis, derived using prototype theory and informed by traditional theoretical approaches to privacy. We evaluate our dictionary categories alongside privacy-related categories from an existing content analysis tool, LIWC, using verbal discussions of privacy issues from a variety of technology and non-technology contexts. We find that our privacy dictionary is better able to distinguish between privacy and non-privacy language, and is less context-dependent than LIWC. However, the more general LIWC categories are able to describe a greater amount of variation in our data. We discuss possible improvements to the privacy dictionary and note future work.
Conference Paper
Full-text available
Privacy has become increasingly important to the database community which is reflected by a noteworthy increase in research papers appearing in the literature. While researchers often assume that their definition of “privacy” is universally held by all readers, this is rarely the case; so many papers addressing key challenges in this domain have actually produced results that do not consider the same problem, even when using similar vocabularies. This paper provides an explicit definition of data privacy suitable for ongoing work in data repositories such as a DBMS or for data mining. The work contributes by briefly providing the larger context for the way privacy is defined legally and legislatively but primarily provides a taxonomy capable of thinking of data privacy technologically. We then demonstrate the taxonomy’s utility by illustrating how this perspective makes it possible to understand the important contribution made by researchers to the issue of privacy. The conclusion of this paper is that privacy is indeed multifaceted so no single current research effort adequately addresses the true breadth of the issues necessary to fully understand the scope of this important issue.
Article
Full-text available
This paper presents the privacy dictionary, a new linguistic resource for automated content analysis on privacy-related texts. To overcome the definitional challenges inherent in privacy research, the dictionary was informed by an inclusive set of relevant theoretical perspectives. Using methods from corpus linguistics, we constructed and validated eight dictionary categories on empirical material from a wide range of privacy sensitive contexts. It was shown that the dictionary categories are able to measure unique linguistic patterns within privacy discussions. At a time when privacy considerations are increasing, and online resources provide ever growing quantities of textual data, the privacy dictionary can play a significant role, not only for research in the social sciences, but also in technology design and policy making.
Conference Paper
Full-text available
This article compares traditional requirements engineering approaches and agile software development. Our paper analyzes commonalities and differences of both approaches and determines possible ways how agile software development can benefit from requirements engineering methods.
Article
Software analytics builds quality prediction models for software projects. Experience shows that (a) the more projects studied, the more varied are the conclusions; and (b) project managers lose faith in the results of software analytics if those results keep changing. To reduce this conclusion instability, we propose the use of “bellwethers”: given N projects from a community the bellwether is the project whose data yields the best predictions on all others. The bellwethers offer a way to mitigate conclusion instability because conclusions about a community are stable as long as this bellwether continues as the best oracle. Bellwethers are also simple to discover (just wrap a for-loop around standard data miners). When compared to other transfer learning methods (TCA+, transfer Naive Bayes, value cognitive boosting), using just the bellwether data to construct a simple transfer learner yields comparable predictions. Further, bellwethers appear in many SE tasks such as defect prediction, effort estimation, and bad smell detection. We hence recommend using bellwethers as a baseline method for transfer learning against which future work should be compared.
Article
Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. While most machine learning algorithms are designed to address single tasks, the development of algorithms that facilitate transfer learning is a topic of ongoing interest in the machine-learning community. This chapter provides an introduction to the goals, settings, and challenges of transfer learning. It surveys current research in this area, giving an overview of the state of the art and outlining the open problems. The survey covers transfer in both inductive learning and reinforcement learning, and discusses the issues of negative transfer and task mapping.
Article
Software development life cycle is a structured process, including the definition of user requirements specification, the system design, and programming. The design task comprises the transfer of natural language specifications into models. The class diagram of UML (Unified Modeling Language) has been considered as one of the most useful diagrams. It is a formal description of user’s requirements and serves as inputs to the developers. The automated extraction of UML class diagram from natural language requirements is a highly challenging task. This paper explains our vision of an automated tool for class diagram generation from user requirements expressed in natural language. Our new approach amalgamates the statistical and pattern recognition properties of Natural Language Processing (NLP) techniques. More than 1000 patterns are defined for the extraction of the class diagram concepts. Once these concepts are captured, an XML Metadata Interchange (XMI) file is generated and imported with a CASE tool to build the corresponding UML class diagram.
Article
The proper protection of data privacy is a complex task that requires a careful analysis of what actually has to be kept private. Several definitions of privacy have been proposed over the years, from traditional syntactic privacy definitions, which capture the protection degree enjoyed by data respondents with a numerical value, to more recent semantic privacy definitions, which take into consideration the mechanism chosen for releasing the data. In this paper, we illustrate the evolution of the definitions of privacy, and we survey some data protection techniques devised for enforcing such definitions. We also illustrate some well-known application scenarios in which the discussed data protection techniques have been successfully used, and present some open issues.
Article
When projects lack sufficient local data to make predictions, they try to transfer information from other projects. How can we best support this process? In the field of software engineering, transfer learning has been shown to be effective for defect prediction. This paper checks whether it is possible to build transfer learners for software effort estimation. We use data on 154 projects from 2 sources to investigate transfer learning between different time intervals and 195 projects from 51 sources to provide evidence on the value of transfer learning for traditional cross-company learning problems. We find that the same transfer learning method can be useful for transfer effort estimation results for the cross-company learning problem and the cross-time learning problem. It is misguided to think that: (1) Old data of an organization is irrelevant to current context or (2) data of another organization cannot be used for local solutions. Transfer learning is a promising research direction that transfers relevant cross data between time intervals and domains.
Conference Paper
Natural language artifacts, such as requirements specifications, often explicitly state the security requirements for software systems. However, these artifacts may also imply additional security requirements that developers may overlook but should consider to strengthen the overall security of the system. The goal of this research is to aid requirements engineers in producing a more comprehensive and classified set of security requirements by (1) automatically identifying security-relevant sentences in natural language requirements artifacts, and (2) providing context-specific security requirements templates to help translate the security-relevant sentences into functional security requirements. Using machine learning techniques, we have developed a tool-assisted process that takes as input a set of natural language artifacts. Our process automatically identifies security-relevant sentences in the artifacts and classifies them according to the security objectives, either explicitly stated or implied by the sentences. We classified 10,963 sentences in six different documents from healthcare domain and extracted corresponding security objectives. Our manual analysis showed that 46% of the sentences were security-relevant. Of these, 28% explicitly mention security while 72% of the sentences are functional requirements with security implications. Using our tool, we correctly predict and classify 82% of the security objectives for all the sentences (precision). We identify 79% of all security objectives implied by the sentences within the documents (recall). Based on our analysis, we develop context-specific templates that can be instantiated into a set of functional security requirements by filling in key information from security-relevant sentences.
Article
Data privacy when using online systems like Facebook and Amazon has become an increasingly popular topic in the last few years. However, only a little is known about how users and developers perceive privacy and which concrete measures would mitigate their privacy concerns. To investigate privacy requirements, we conducted an online survey with closed and open questions and collected 408 valid responses. Our results show that users often reduce privacy to security, with data sharing and data breaches being their biggest concerns. Users are more concerned about the content of their documents and their personal data such as location than about their interaction data. Unlike users, developers clearly prefer technical measures like data anonymization and think that privacy laws and policies are less effective. We also observed interesting differences between people from different geographies. For example, people from Europe are more concerned about data breaches than people from North America. People from Asia/Pacific and Europe believe that content and metadata are more critical for privacy than people from North America. Our results contribute to developing a user-driven privacy framework that is based on empirical evidence in addition to the legal, technical, and commercial perspectives.
Article
The field of machine learning has matured to the point where many sophisticated learning approaches can be applied to practical applications. Thus it is of critical importance that researchers have the proper tools to evaluate learning approaches and understand the underlying issues. This book examines various aspects of the evaluation process with an emphasis on classification algorithms. The authors describe several techniques for classifier performance assessment, error estimation and resampling, obtaining statistical significance as well as selecting appropriate domains for evaluation. They also present a unified evaluation framework and highlight how different components of evaluation are both significantly interrelated and interdependent. The techniques presented in the book are illustrated using R and WEKA facilitating better practical insight as well as implementation. Aimed at researchers in the theory and applications of machine learning, this book offers a solid basis for conducting performance evaluations of algorithms in practical settings.
Article
An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.
Conference Paper
While all systems have non-functional requirements (NFRs), they may not be explicitly stated in a formal requirements specification. Furthermore, NFRs may also be externally imposed via government regulations or industry standards. As some NFRs represent emergent system proprieties, those NFRs require appropriate analysis and design efforts to ensure they are met. When the specified NFRs are not met, projects incur costly re-work to correct the issues. The goal of our research is to aid analysts in more effectively extracting relevant non-functional requirements in available unconstrained natural language documents through automated natural language processing. Specifically, we examine which document types (data use agreements, install manuals, regulations, request for proposals, requirements specifications, and user manuals) contain NFRs categorized to 14 NFR categories (e.g. capacity, reliability, and security). We measure how effectively we can identify and classify NFR statements within these documents. In each of the documents evaluated, we found NFRs present. Using a word vector representation of the NFRs, a support vector machine algorithm performed twice as effectively compared to the same input to a multinomial naïve Bayes classifier. Our k-nearest neighbor classifier with a unique distance metric had an F1 measure of 0.54, outperforming in our experiments the optimal naïve Bayes classifier which had a F1 measure of 0.32. We also found that stop word lists beyond common determiners had no minimal performance effect.
Book
Agile requirements: discovering what your users really want. With this book, you will learn to: Flexible, quick and practical requirements that work Save time and develop better software that meets users' needs Gathering user stories -- even when you can't talk to users How user stories work, and how they differ from use cases, scenarios, and traditional requirements Leveraging user stories as part of planning, scheduling, estimating, and testing Ideal for Extreme Programming, Scrum, or any other agile methodology ----------------------------------------------------------------------------------------------------------Thoroughly reviewed and eagerly anticipated by the agile community, User Stories Applied offers a requirements process that saves time, eliminates rework, and leads directly to better software.The best way to build software that meets users' needs is to begin with "user stories": simple, clear, brief descriptions of functionality that will be valuable to real users. In User Stories Applied, Mike Cohn provides you with a front-to-back blueprint for writing these user stories and weaving them into your development lifecycle.You'll learn what makes a great user story, and what makes a bad one. You'll discover practical ways to gather user stories, even when you can't speak with your users. Then, once you've compiled your user stories, Cohn shows how to organize them, prioritize them, and use them for planning, management, and testing. User role modeling: understanding what users have in common, and where they differ Gathering stories: user interviewing, questionnaires, observation, and workshops Working with managers, trainers, salespeople and other "proxies" Writing user stories for acceptance testing Using stories to prioritize, set schedules, and estimate release costs Includes end-of-chapter practice questions and exercisesUser Stories Applied will be invaluable to every software developer, tester, analyst, and manager working with any agile method: XP, Scrum... or even your own home-grown approach.ADDISON-WESLEY PROFESSIONALBoston, MA 02116www.awprofessional.comISBN: 0-321-20568-5
Article
In most IT projects, software developers usually pay attention to functional requirements that satisfy business needs of the system. Non-functional requirements (NFR) such as performance, usability, security, etc. are usually handled ad-hoc during the system testing phase, when it is late and costly to fix problems. Due to the importance and criticality of NFR, I study the problem of modeling NFR for Software Product Lines (SPL), which adds yet an additional dimension of complexity. This paper will survey the software engineering literature, in search of a systematic way to analyze and design NFR, from the perspectives of the concept of commonality and variability of SPL and the characteristics of NFR. Finally, I will propose a methodology based on the extension of Product Line UML-Based Software Engineering (PLUS) techniques, for a unified and automated method to model NFR throughout all phases of SPL engineering.
Conference Paper
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient noun-phrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe a hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted subcompounds improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction.
Article
An analysis of data from 16 software development organizations reveals seven agile RE practices, along with their benefits and challenges. The rapidly changing business environment in which most organizations operate is challenging traditional requirements-engineering (RE) approaches. Software development organizations often must deal with requirements that tend to evolve quickly and become obsolete even before project completion. Rapid changes in competitive threats, stakeholder preferences, development technology, and time-to-market pressures make prespecified requirements inappropriate.
Article
. An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies. Keywords: classification, comparative studi...
Article
Das Thema der Diplomarbeit ist 'Requirements Engineering in Agile Software Development'. Der erste Teil der Diplomarbeit befasst sich mit den beiden Software Engineering Ansatzen und vergleicht diese. Wahrend das traditionelle Requirements Engineering mehr dokumentenorientiert ist, versuchen agile Methoden Dokumentation so weit wie moglich zu reduzieren. Die Diplomarbeit analysiert die Unterschiede und Gemeinsamkeiten der beiden Ansatze und zeigt, wie agile Methoden von Requirements Engineering Techniken profitieren konnten.
Article
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe a hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted subcompounds improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction. 1
Refinement of user stories into backlog items: Linguistic structure and action verbs
  • Müter