Fig 1 - uploaded by Alexander LeClair
Content may be subject to copyright.
An overview of this paper.

An overview of this paper.

Source publication
Conference Paper
Full-text available
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as “editors” or “science.” Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification,...

Context in source publication

Context 1
... argue that they are for the problem of categorization, especially given their ability to model embeddings. This paper has four major components as shown in the overview in Figure 1. First, in Section IV, we prepare a corpus comprised of C/C++ projects from the Debian packages repository, totaling over 1.5 million files and 6.6 million func- tions. ...

Citations

... Several approaches have been proposed that utilize neural networks for code classification. LeClair et al. [32] and Ohashi et al. [39] both use convolutional neural networks (CNN) as their model. LeClair et al. use a C-LSTM, a combination of CNN and recurrent neural networks, with the project name, function name, and function content as input. ...
Article
Full-text available
One of the most time-consuming tasks for developers is the comprehension of new code bases. An effective approach to aid this process is to label source code files with meaningful annotations, which can help developers understand the content and functionality of a code base quicker. However, most existing solutions for code annotation focus on project-level classification: manually labelling individual files is time-consuming, error-prone and hard to scale. The work presented in this paper aims to automate the annotation of files by leveraging project-level labels; and using the file-level annotations to annotate items at larger levels of granularity, for example, packages and a whole project. We propose a novel approach to annotate source code files using a weak labelling approach and a subsequent hierarchical aggregation. We investigate whether this approach is effective in achieving multi-granular annotations of software projects, which can aid developers in understanding the content and functionalities of a code base more quickly. Our evaluation uses a combination of human assessment and automated metrics to evaluate the annotations’ quality. Our approach correctly annotated 50% of files and more than 50% of packages. Moreover, the information captured at the file-level allowed us to identify, on average, three new relevant labels for any given project. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself.
... Therefore, any complex code element adds complexity to comprehension as a whole. The PyPI libraries can be categorized based on behavior and characteristic, which are implemented for different purposes [24]. Based on the PyPI classifiers page * 4 , the topic is one of the classifiers used for categorizing types of libraries based on their characteristics. ...
Article
A README file plays an essential role as the face of a software project and the initial point of contact for developers in Open Source Software (OSS) projects. The code snippet ranks among the most important content in the README file for demonstrating the usage of software and APIs. While easy to comprehend, code snippets are preferred by clients in order to quickly understand software usage and features. However, proficient code snippets are sometimes found in README files. In this paper, we first investigate the prevalence of each competency level of Python code snippets in the README files. Then, we analyze the relationships between the usage of proficient code snippets and topics of libraries. From our empirical study on 1, 620 README files of PyPI libraries, we find that developers mainly present 92% of basic elements in code snippets. However, developers are likely to present proficient elements in code snippets from topics about Application Framework, Quality Assurance, and User Interface. We therefore (i) encourage developers to mainly present basic code snippets in their README files to attract more newcomers, and (ii) suggest that clients try to understand proficient code snippets if they are adopting libraries from previously mentioned topics.
... LeClair et al. [32] worked on a dataset of C/C++ projects from the Debian package repository. The dataset consists of 9,804 software projects divided into 75 categories. ...
Article
Context GitHub is the world's most prominent host of source code, with more than 327M repositories. However, most of these repositories are not labelled or inadequately, making it harder for users to find relevant projects. Various proposals for software application domain classification over the past years have been proposed. However, these several of those approaches suffer from multiple issues, called antipatterns of software classification, that reduce their usability. Objective In this paper, we propose a new taxonomy in the GitHub ecosystem, called GitRanking, starting from a well‐structured data set, composed of curated repositories annotated with topics. The main objective is to create a baseline methodology for software classification that is expandable, hierarchical, grounded in a knowledge base, and free of antipatterns. Method We collected 121K topics from GitHub and used GitRanking to create a taxonomy of 301 ranked application domains. GitRanking (1) uses active sampling to ensure a minimal number of annotations to create the ranking; and (2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Furthermore, we adopt the conceived taxonomy in a classification task by considering a state‐of‐the‐art classifier. Results Our results show that GitRanking can effectively rank terms in a hierarchy according to how general or specific their meaning is. Furthermore, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked, and with a minimum number of annotations (). Concerning the classification task, we show that the model achieves an F1‐score of 34%, with a precision of 54%. Conclusion This paper is the first collective attempt at building a ground‐up taxonomy of software domains. Our vision is that our taxonomy, and its extensibility, can be used to better and more precisely label software projects.
... Among its other uses, MSR allows to empirically study otherwise subjective or external phenomena in combination with (or through) large-scale systematic mining of development artifacts. Fields such as green software engineering (Hindle, 2013;Pereira, Matalonga, Couto, Castor, Cabral, Carvalho, de Sousa and Fernandes, 2021), risk assessment (Choetkiertikul, Dam, Tran and Ghose, 2015;da Costa, McIntosh, Treude, Kulesza and Hassan, 2017;Choetkiertikul, Dam, Tran, Ghose and Grundy, 2018), and software classification (Howard, Gupta, Pollock and Vijay-Shanker, 2013;LeClair, Eberhart and McMillan, 2018;Sas and Capiluppi, 2022) have advanced noticeably due to MSR. In a similar fashion to these works, we hypothesize that the amount and diversity of costs-related information in cloud-based software project repositories is sufficient to produce meaningful insights. ...
Preprint
Full-text available
Context: The popularity of cloud computing as the primary platform for developing, deploying, and delivering software is largely driven by the promise of cost savings. Therefore, it is surprising that no empirical evidence has been collected to determine whether cost awareness permeates the development process and how it manifests in practice. Objective: This study aims to provide empirical evidence of cost awareness by mining open source repositories of cloud-based applications. The focus is on Infrastructure as Code artifacts that automate software (re)deployment on the cloud. Methods: A systematic search through 152,735 repositories resulted in the selection of 2,010 relevant ones. We then analyzed 538 relevant commits and 208 relevant issues using a combination of inductive and deductive coding. Results: The findings indicate that developers are not only concerned with the cost of their application deployments but also take actions to reduce these costs beyond selecting cheaper cloud services. We also identify research areas for future consideration. Conclusion: Although we focus on a particular Infrastructure as Code technology (Terraform), the findings can be applicable to cloud-based application development in general. The provided empirical grounding can serve developers seeking to reduce costs through service selection, resource allocation, deployment optimization, and other techniques.
... In [64], the decision tree-based classification method has been used to classify source codes related to sorting algorithms. LeClair et al. [65] mentioned that the source code can be classified into six categories: games, admin, network, words, science, and usage. Xu et al. [66] used LSTM and CNN to identify vulnerabilities in source code. ...
... In addition, we reviewed a large body of literature on program code classification. We found that studies classify codes based on various types of meta-information of the code, including programming language [58][59][60][61], code tags [63], errors [8,28], and category [64,65]. To the best of our knowledge, no study has considered the algorithmic (structural) features of codes in the classification task. ...
Article
Full-text available
In software, an algorithm is a well-organized sequence of actions that provides the optimal way to complete a task. Algorithmic thinking is also essential to break-down a problem and conceptualize solutions in some steps. The proper selection of an algorithm is pivotal to improve computational performance and software productivity as well as to programming learning. That is, determining a suitable algorithm from a given code is widely relevant in software engineering and programming education. However, both humans and machines find it difficult to identify algorithms from code without any meta-information. This study aims to propose a program code classification model that uses a convolutional neural network (CNN) to classify codes based on the algorithm. First, program codes are transformed into a sequence of structural features (SFs). Second, SFs are transformed into a one-hot binary matrix using several procedures. Third, different structures and hyperparameters of the CNN model are fine-tuned to identify the best model for the code classification task. To do so, 61,614 real-world program codes of different types of algorithms collected from an online judge system are used to train, validate, and evaluate the model. Finally, the experimental results show that the proposed model can identify algorithms and classify program codes with a high percentage of accuracy. The average precision, recall, and F-measure scores of the best CNN model are 95.65%, 95.85%, and 95.70%, respectively, indicating that it outperforms other baseline models.
... MUDABlue [13], relies on Latent Semantic Indexing (LSI), an Information Retrieval (IR) technique, to automatically categorize software systems in open source software repositories. For the same task, LACT [27] uses Latent Dirichlet Allocation (LDA), and recently neural text classification with word embeddings has been used [14] to categorize similar software projects. ...
Preprint
Full-text available
Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.
... These observations guide us to design a two-phase framework for automating troubleshooting guides that first identifies TSG components (Component Identification) and then parses them to extract constituents necessary for execution (Component Parsing). There has been significant research on text/code identification and parsing in prior work [32,65]. However, applying them directly to TSGs requires addressing some unique challenges: ...
Preprint
Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
... LeClair et al. [14] propose a dataset of C/C++ projects from the Debian package repository. The dataset consists of 9,804 software projects divided into 75 categories: many of these categories have only a few examples, and 19 are the same categories with different surface forms, more specifically 'contrib/X', where X is a category present in the list. ...
Preprint
Full-text available
GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.
... We used AST paths instead of source code as input, since the paths contain more structural information about the code than the source code text. AST paths are also the standard in code summarization work [4,20]. To understand more about AST paths, please refer to the Appendix Section A.3. ...
... Source Code [18,19,9,20,21,22,23,14,24] Other Project Data [25,26,10,27,28,29,30,11] (A) source code; and (B) other project data (e.g., README files), as we are interested in the classification task using semantic information, and structural (can be extracted from source code). Table 1 contains a list of the works divided by their approach. ...
... LeClair et al. [20] used a neural network approach. The authors use the project name, function name, and the function content as input to a C-LSTM [32], a combined model of convolutional and recurrent neural networks. ...
... • Consistency: lastly, we check if the classification has any other issue that affect its consistency like duplicate or overlapping categories, categories with a bad surface form (e.g. in [20] there is 'contrib/math' and 'math'), or any other abnormality. ...
Preprint
Full-text available
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the `antipatterns' of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification.