Yaroslav GolubevJetBrains · JetBrains Research
Yaroslav Golubev
Master of Technology
Research administrator at JetBrains Research
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
My main areas of interest include large-scale code analysis, statistics, machine learning in software engineering, and, as a hobby, linguistics and natural language processing.
September 2020 - August 2025
September 2018 - August 2020
September 2014 - August 2018
Publications (70)
Despite the availability of refactoring as a feature in popular IDEs, recent studies revealed that developers are reluctant to use them, and still prefer the manual refactoring of their code. At JetBrains, our goal is to fully support refactoring features in IntelliJ-based IDEs and improve their adoption in practice. Therefore, we start by raising...
Code changes constitute one of the most important features of software evolution. Studying them can provide insights into the nature of software development and also lead to practical solutions - recommendations and automations of popular changes for developers. In our work, we developed a tool called PythonChangeMiner that allows to discover code...
Code cloning plays a very important role in open-source software engineering. The presence of clones within a project may indicate a need for refactoring, and clones between projects are even more interesting, since code migration takes place and violations are possible. But how is code being copied? How prevalent is the process and on what level d...
In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic d...
This paper introduces the human-curated Pandas-PlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data-such as a Pandas DataFrame-based on natural language instructions, complementing current evaluation tools and expanding...
In many MOOCs, whenever a student completes a programming task, they can see previous solutions of other students to find potentially different ways of solving the problem and to learn new coding constructs. However, a lot of MOOCs simply show the most recent solutions, disregarding their diversity or quality, and thus hindering the students' oppor...
Commit message generation (CMG) is a crucial task in software engineering that is challenging to evaluate correctly. When a CMG system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system...
Students often struggle with solving programming problems when learning to code, especially when they have to do it online, with one of the most common disadvantages of working online being the lack of personalized help. This help can be provided as next-step hint generation, i.e., showing a student what specific small step they need to do next to...
LLMs change the landscape of software engineering, and the question arises: “How can we combine LLMs with traditional teaching approaches in computer science?”. In this work, we propose to teach students in a low-code environment of code generation, developing not only their coding but also decomposition and prompting skills.
LLMs change the landscape of software engineering, and the question arises: “How can we combine LLMs with traditional teaching approaches in computer science?”. In this work, we propose to teach students in a low-code environment of code generation, developing not only their coding but also decomposition and prompting skills.
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, whi...
The last several years saw the emergence of AI assistants for code -- multi-purpose AI-based helpers in software engineering. Their quick development makes it necessary to better understand how specifically developers are using them, why they are not using them in certain parts of their development workflow, and what needs to be improved. In this w...
In recent years, several industrial solutions for the problem of multi-token code completion have appeared, each making a great advance in the area but mostly focusing on cloud-based runtime and avoiding working on the end user's device.
In this work, we describe our approach for building a multi-token code completion feature for the JetBrains' Int...
In many MOOCs, whenever a student completes a programming task, they can see previous solutions of other students to find potentially different ways of solving the problem and learn new coding constructs. However, a lot of MOOCs simply show the most recent solutions, disregarding their diversity or quality.
To solve this novel problem, we adapted t...
This paper adopts a cognitive psychology perspective to investigate the recurring mistakes in code resulting from the mental set (Einstellung) effect. The Einstellung effect is the tendency to approach problem-solving with a preconceived mindset, often overlooking better solutions that may be available. This effect can significantly impact creative...
Commit messages are crucial to software development, allowing developers to track changes and collaborate effectively. Despite their utility, most commit messages lack important information since writing high-quality commit messages is tedious and time-consuming. The active research on commit message generation (CMG) has not yet led to wide adoptio...
This paper adopts a cognitive psychology perspective to investigate the recurring mistakes in code resulting from the mental set (Einstellung) effect. The Einstellung effect is the tendency to approach problem-solving with a preconceived mindset, often overlooking better solutions that may be available. This effect can significantly impact creative...
In this work, we developed an algorithm for detecting code quality issues in the templates of online programming tasks, validated it, and conducted an empirical study on the dataset of student solutions. The algorithm consists of analyzing recurring unfixed issues in solutions of different students, matching them with the code of the template, and...
In this paper, we present an approach for transferring an optimal lower size threshold for clone detection from one language to another by analyzing their clone distributions. We showcase this method by transferring the threshold from regular Python scripts to Jupyter notebooks for using in two JetBrains IDEs, Datalore and DataSpell.
Context: Refactoring is a critical task in software maintenance, and is usually performed to enforce better design and coding practices, while coping with design defects. The Extract Method refactoring is widely used for merging duplicate code fragments into a single new method. Several studies attempted to recommend Extract Method refactoring oppo...
Programming education should aim to provide students with a broad range of skills that they will later use while developing software. An important aspect in this is their ability to write code that is not only correct but also of high quality. Unfortunately, this is difficult to control in the setting of a massive open online course. In this paper,...
Competitive programming remains a very popular activity that combines both software engineering and education. In order to prepare and to practice, contestants use extensive archives of problems from past contents available on various competitive programming platforms. One way to make this process more effective is to provide an automatic tag syste...
In software engineering, different approaches and machine learning models leverage different types of data: source code, textual information, historical data. An important part of any project is its dependencies. The list of dependencies is relatively small but carries a lot of semantics with it, which can be used to compare projects or make judgem...
Integrated Development Environments (IDE) are designed to make users more productive, as well as to make their work more comfortable. To achieve this, a lot of diverse tools are embedded into IDEs, and the developers of IDEs can employ anonymous usage logs to collect the data about how they are being used to improve them. A particularly important c...
The automatic collection of stack traces in bug tracking systems is an integral part of many software projects and their maintenance. However, such reports often contain a lot of duplicates, and the problem of de-duplicating them into groups arises. In this paper, we propose a new approach to solve the deduplication task and report on its use on th...
In recent years, Jupyter notebooks have grown in popularity in several domains of software engineering, such as data science, machine learning, and computer science education. Their popularity has to do with their rich features for presenting and visualizing data, however, recent studies show that notebooks also share a lot of drawbacks: high numbe...
In this paper, we present Lupa - a framework for large-scale analysis of the programming language usage. Lupa is a command line tool that uses the power of the IntelliJ Platform under the hood, which gives it access to powerful static analysis tools used in modern IDEs. The tool supports custom analyzers that process the rich concrete syntax tree o...
The task of finding the best developer to fix a bug is called bug triage. Most of the existing approaches consider the bug triage task as a classification problem, however, classification is not appropriate when the sets of classes change over time (as developers often do in a project). Furthermore, to the best of our knowledge, all the existing mo...
We have developed a plugin for IntelliJ IDEA called AntiCopyPaster that tracks the pasting of code inside the IDE and suggests appropriate Extract Method refactorings to combat the propagation of duplicates. To implement the plugin, we gathered a dataset of code fragments that should and should not be extracted, compiled a list of metrics of code t...
Jupyter notebooks represent a unique format for programming - a combination of code and Markdown with rich formatting, separated into individual cells. We propose to perceive a Jupyter Notebook cell as a simplified and raw version of a programming function. Similar to functions, Jupyter cells should strive to contain singular, self-contained action...
The popularity of cloud technologies has led to the development of a new type of applications that specifically target cloud environments. Such applications require a lot of cloud infrastructure to run, which brought about the Infrastructure as Code approach, where the infrastructure is also coded using a separate language in parallel to the main a...
Similarly to production code, code smells also occur in test code, where they are called test smells. Test smells have a detrimental effect not only on test code but also on the production code that is being tested. To date, the majority of the research on test smells has been focusing on programming languages such as Java and Scala. However, there...
In software engineering, a great number of new approaches are being actively researched, and a lot of tools are being developed based on them. These tools require a framework for their creation and an opportunity to be used by potential developers. Modern IDEs provide both. In this paper, we describe the main capabilities of the IntelliJ Platform t...
Many code changes that developers make in their projects are repeated and constitute recurrent change patterns. It is of interest to collect such patterns from the version history of open-source repositories and suggest the most useful of them as quick fixes. In this paper, we present Revizor - a tool aimed to build custom plugins for PyCharm, a po...
The popularity of cloud technologies has led to the development of a new type of applications that specifically target cloud environments. Such applications require a lot of cloud infrastructure to run, which brought about the Infrastructure as Code approach, where the infrastructure is also coded using a separate language in parallel to the main a...
Similarly to production code, code smells also occur in test code, where they are called test smells. Test smells have a detrimental effect not only on test code but also on the production code that is being tested. To date, the majority of the research on test smells has been focusing on programming languages such as Java and Scala. However, there...
Software development is a complex process that includes many different tasks besides just writing code. One of the aspects of software engineering is selecting and managing licenses for the given project. In this paper, we present Sorrel - a plugin for managing licenses and detecting potential incompatibilities for IntelliJ IDEA, a popular Java IDE...
A lot of problems in the field of software engineering - bug fixing, commit message generation, etc. - require analyzing not only the code itself but specifically code changes. Applying machine learning models to these tasks requires us to create numerical representations of the changes, i.e. embeddings. Recent studies demonstrate that the best way...
Recent trends in Web development demonstrate an increased interest in serverless applications, i.e. applications that utilize computational resources provided by cloud services on demand instead of requiring traditional server management. This approach enables better resource management while being scalable, reliable, and cost-effective. However, i...
Clone detection plays an important role in software engineering. Finding clones within a single project introduces possible refactoring opportunities, and between different projects it could be used for detecting code reuse or possible licensing violations.In this paper, we propose a modification to bag-of-tokens based clone detection that allows d...
Double-pulse femtosecond laser ablation of thin aluminum films and bulk aluminum counterintuitively demonstrated a strong (60-70%) raise of the thickness-dependent thresholds for inter-pulse delays of 20-200 ps, preventing material removal at above-threshold fluencies. Time-resolved optical pump-probe reflection and double-pump transmission studies...
In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 mil...
Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own original class. Several approaches have been proposed to recommend Move Method refactoring automatically. Most...
With an ever-increasing amount of open source software, the popularity of services like GitHub that facilitate code reuse, and common misconceptions about the licensing of open source software, the problem of license violations in the code is getting more and more prominent. In this study, we compile an extensive corpus of popular Java projects fro...
The process of formation and the characteristics of silver nanostructures created by pulsed laser annealing in the air are studied. Nanoparticles were obtained by way of irradiating thin silver films (62 and 175 nm) on the dielectric (glass) base with an excimer laser emission ($lambda$ = 193 nm). Created nanostructures were studied using the metho...
Переход на двухуровневую систему обучения и стандарты нового поколения привёл к ограничению числа часов, отводимых на изучение курса общей физики. Курс физики твёрдого тела не является исключением. Сохранить объём изучаемого материала возможно путём перенесения части его на самостоятельное изучение при подготовке к выполнению лабораторных работ. В...