Fig 3 - uploaded by Alexander LeClair
Content may be subject to copyright.
F1 Score for code-only and code-description standard datasets. We observe a significant drop in performance from codedescription to code-only in all cases, though this drop was lower for nn+cd than nn+w. Best overall performer was nn+cd.
Source publication
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as “editors” or “science.” Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...
Context in source publication
Citations
... Therefore, any complex code element adds complexity to comprehension as a whole. The PyPI libraries can be categorized based on behavior and characteristic, which are implemented for different purposes [24]. Based on the PyPI classifiers page * 4 , the topic is one of the classifiers used for categorizing types of libraries based on their characteristics. ...
A README file plays an essential role as the face of a software project and the initial point of contact for developers in Open Source Software (OSS) projects. The code snippet ranks among the most important content in the README file for demonstrating the usage of software and APIs. While easy to comprehend, code snippets are preferred by clients in order to quickly understand software usage and features. However, proficient code snippets are sometimes found in README files. In this paper, we first investigate the prevalence of each competency level of Python code snippets in the README files. Then, we analyze the relationships between the usage of proficient code snippets and topics of libraries. From our empirical study on 1, 620 README files of PyPI libraries, we find that developers mainly present 92% of basic elements in code snippets. However, developers are likely to present proficient elements in code snippets from topics about Application Framework, Quality Assurance, and User Interface. We therefore (i) encourage developers to mainly present basic code snippets in their README files to attract more newcomers, and (ii) suggest that clients try to understand proficient code snippets if they are adopting libraries from previously mentioned topics.
... Among its other uses, MSR allows to empirically study otherwise subjective or external phenomena in combination with (or through) large-scale systematic mining of development artifacts. Fields such as green software engineering (Hindle, 2013;Pereira, Matalonga, Couto, Castor, Cabral, Carvalho, de Sousa and Fernandes, 2021), risk assessment (Choetkiertikul, Dam, Tran and Ghose, 2015;da Costa, McIntosh, Treude, Kulesza and Hassan, 2017;Choetkiertikul, Dam, Tran, Ghose and Grundy, 2018), and software classification (Howard, Gupta, Pollock and Vijay-Shanker, 2013;LeClair, Eberhart and McMillan, 2018;Sas and Capiluppi, 2022) have advanced noticeably due to MSR. In a similar fashion to these works, we hypothesize that the amount and diversity of costs-related information in cloud-based software project repositories is sufficient to produce meaningful insights. ...
Context: The popularity of cloud computing as the primary platform for developing, deploying, and delivering software is largely driven by the promise of cost savings. Therefore, it is surprising that no empirical evidence has been collected to determine whether cost awareness permeates the development process and how it manifests in practice. Objective: This study aims to provide empirical evidence of cost awareness by mining open source repositories of cloud-based applications. The focus is on Infrastructure as Code artifacts that automate software (re)deployment on the cloud. Methods: A systematic search through 152,735 repositories resulted in the selection of 2,010 relevant ones. We then analyzed 538 relevant commits and 208 relevant issues using a combination of inductive and deductive coding. Results: The findings indicate that developers are not only concerned with the cost of their application deployments but also take actions to reduce these costs beyond selecting cheaper cloud services. We also identify research areas for future consideration. Conclusion: Although we focus on a particular Infrastructure as Code technology (Terraform), the findings can be applicable to cloud-based application development in general. The provided empirical grounding can serve developers seeking to reduce costs through service selection, resource allocation, deployment optimization, and other techniques.
... In [64], the decision tree-based classification method has been used to classify source codes related to sorting algorithms. LeClair et al. [65] mentioned that the source code can be classified into six categories: games, admin, network, words, science, and usage. Xu et al. [66] used LSTM and CNN to identify vulnerabilities in source code. ...
... In addition, we reviewed a large body of literature on program code classification. We found that studies classify codes based on various types of meta-information of the code, including programming language [58][59][60][61], code tags [63], errors [8,28], and category [64,65]. To the best of our knowledge, no study has considered the algorithmic (structural) features of codes in the classification task. ...
In software, an algorithm is a well-organized sequence of actions that provides the optimal way to complete a task. Algorithmic thinking is also essential to break-down a problem and conceptualize solutions in some steps. The proper selection of an algorithm is pivotal to improve computational performance and software productivity as well as to programming learning. That is, determining a suitable algorithm from a given code is widely relevant in software engineering and programming education. However, both humans and machines find it difficult to identify algorithms from code without any meta-information. This study aims to propose a program code classification model that uses a convolutional neural network (CNN) to classify codes based on the algorithm. First, program codes are transformed into a sequence of structural features (SFs). Second, SFs are transformed into a one-hot binary matrix using several procedures. Third, different structures and hyperparameters of the CNN model are fine-tuned to identify the best model for the code classification task. To do so, 61,614 real-world program codes of different types of algorithms collected from an online judge system are used to train, validate, and evaluate the model. Finally, the experimental results show that the proposed model can identify algorithms and classify program codes with a high percentage of accuracy. The average precision, recall, and F-measure scores of the best CNN model are 95.65%, 95.85%, and 95.70%, respectively, indicating that it outperforms other baseline models.
... MUDABlue [13], relies on Latent Semantic Indexing (LSI), an Information Retrieval (IR) technique, to automatically categorize software systems in open source software repositories. For the same task, LACT [27] uses Latent Dirichlet Allocation (LDA), and recently neural text classification with word embeddings has been used [14] to categorize similar software projects. ...
Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.
... These observations guide us to design a two-phase framework for automating troubleshooting guides that first identifies TSG components (Component Identification) and then parses them to extract constituents necessary for execution (Component Parsing). There has been significant research on text/code identification and parsing in prior work [32,65]. However, applying them directly to TSGs requires addressing some unique challenges: ...
Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
... LeClair et al. [14] propose a dataset of C/C++ projects from the Debian package repository. The dataset consists of 9,804 software projects divided into 75 categories: many of these categories have only a few examples, and 19 are the same categories with different surface forms, more specifically 'contrib/X', where X is a category present in the list. ...
GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
... We used AST paths instead of source code as input, since the paths contain more structural information about the code than the source code text. AST paths are also the standard in code summarization work [4,20]. To understand more about AST paths, please refer to the Appendix Section A.3. ...
... Source Code [18,19,9,20,21,22,23,14,24] Other Project Data [25,26,10,27,28,29,30,11] (A) source code; and (B) other project data (e.g., README files), as we are interested in the classification task using semantic information, and structural (can be extracted from source code). Table 1 contains a list of the works divided by their approach. ...
... LeClair et al. [20] used a neural network approach. The authors use the project name, function name, and the function content as input to a C-LSTM [32], a combined model of convolutional and recurrent neural networks. ...
... • Consistency: lastly, we check if the classification has any other issue that affect its consistency like duplicate or overlapping categories, categories with a bad surface form (e.g. in [20] there is 'contrib/math' and 'math'), or any other abnormality. ...
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the `antipatterns' of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification.
... In the last decade, both Machine Learning (ML) and Natural Language Processing have been applied in the study of the source code (Ugurel, Krovetz & Giles, 2002), and the problem of classification of programming languages has been also broadly discussed (Van Dam & Zaytsev, 2016), using Neural Networks (Gilda, 2017), Bayesian learning (Khasnabish et al., 2014) or Multinomial Naïve Bayes (Alreshedy et al., 2018), Convolutional Neural Networks (CNN) (Ohashi & Watanobe, 2019) and Neural Text Classification (LeClair, Eberhart & McMillan, 2018) or even, alternatively the usage of Speech Recognition techniques (Madani et al., 2010). However, all of the presented similar works are mostly concentrating on differentiating between different programming languages than on a problem of differentiating between source code and natural text, however some of the presented concepts may be useful for the presented problem. ...
In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes.
... The source code modeling converts the source code into semantic vectors by extracting its features, and the semantic vectors are the input of ASCS generation. In software engineering, building a high-quality source code model is the first step of many tasks, such as code classification [33][34][35], code search [25,36,37], code clone detection [38][39][40] and code comment [41][42][43][44]. In recent years, almost all source code modeling uses machine learning, and paper [45] can be used as a reference. ...
Source code summarization refers to the natural language description of the source code’s function. It can help developers easily understand the semantics of the source code. We can think of the source code and the corresponding summarization as being symmetric. However, the existing source code summarization is mismatched with the source code, missing, or out of date. Manual source code summarization is inefficient and requires a lot of human efforts. To overcome such situations, many studies have been conducted on Automatic Source Code Summarization (ASCS). Given a set of source code, the ASCS techniques can automatically generate a summary described with natural language. In this paper, we give a review of the development of ASCS technology. Almost all ASCS technology involves the following stages: source code modeling, code summarization generation, and quality evaluation. We further categorize the existing ASCS techniques based on the above stages and analyze their advantages and shortcomings. We also draw a clear map on the development of the existing algorithms.