Content uploaded by Tim Puhlfürß
Author content
All content in this area was uploaded by Tim Puhlfürß on Oct 30, 2024
Content may be subject to copyright.
Can Developers Prompt? A Controlled Experiment
for Code Documentation Generation
Hans-Alexander Kruse , Tim Puhlf¨
urß , Walid Maalej
Universit¨
at Hamburg
Hamburg, Germany
hans-alexander.kruse@studium.uni-hamburg.de, tim.puhlfuerss@uni-hamburg.de, walid.maalej@uni-hamburg.de
Abstract—Large language models (LLMs) bear great potential
for automating tedious development tasks such as creating and
maintaining code documentation. However, it is unclear to what
extent developers can effectively prompt LLMs to create concise
and useful documentation. We report on a controlled experiment
with 20 professionals and 30 computer science students tasked
with code documentation generation for two Python functions.
The experimental group freely entered ad-hoc prompts in a Chat-
GPT-like extension of Visual Studio Code, while the control group
executed a predefined few-shot prompt. Our results reveal that
professionals and students were unaware of or unable to apply
prompt engineering techniques. Especially students perceived the
documentation produced from ad-hoc prompts as significantly
less readable, less concise, and less helpful than documentation
from prepared prompts. Some professionals produced higher
quality documentation by just including the keyword Docstring
in their ad-hoc prompts. While students desired more support in
formulating prompts, professionals appreciated the flexibility of
ad-hoc prompting. Participants in both groups rarely assessed the
output as perfect. Instead, they understood the tools as support to
iteratively refine the documentation. Further research is needed
to understand which prompting skills and preferences developers
have and which support they need for certain tasks.
Index Terms—Software Documentation, Large Language
Model, Program Comprehension, Developer Study, AI4SE
I. INTRODUCTION
Developers often overlook or ignore software documenta-
tion and generally assign it a low priority [1]. Yet, carefully
documenting code is an essential task in software engineering.
Up-to-date and high-quality documentation facilitates program
comprehension [2], accelerates developer onboarding [3], [4],
and mitigates technical debt [5]. Software documentation is
also central to software maintenance, as documentation often
requires updates and should evolve with software [6].
Researchers and tool vendors have thus investigated differ-
ent ways to automate documentation [7]. In particular, LLMs
designed to model and generate human language [8] hold
great potential in automating documentation tasks [9]. Among
popular LLM use cases, code summarization has recently
gained much attention [10]. However, the output of LLMs
usually depends on the user input, called prompt. Slightly
changing the prompt can lead to different results regarding
conciseness, language style, and content of generated text [11].
Several prompt engineering techniques have emerged to
optimize interactions with LLMs [12]–[15]. Few-shot prompt-
ing is one of the most prominent techniques and comprises
adding example outputs to a predefined prompt to specify
output requirements [16], [17]. This technique allows users
to optimize LLM responses while minimizing the number of
input messages sent to the model [10], [14].
Previous research has primarily focused on benchmarking
the performance of various models and prompt engineering
techniques for generating source code documentation [7]. Re-
cently, Ahmed and Devanbu concluded that common prompt
engineering techniques perform better than ad-hoc prompting
according to popular benchmark metrics [18]. However, stud-
ies focusing on the perspectives of developers when using
LLMs, e.g., to generate documentation, are still sparse. In
fact, a recent study showed that common metrics to evaluate
LLMs do not align with human evaluation to assess the quality
of generated documentation [19]. Hence, it remains uncer-
tain whether prompt engineering techniques meet developers’
requirements for generating and using code documentation.
Moreover, it is unclear whether developers prefer a flexible,
iterative interaction using chatbot-like ad-hoc prompting or a
rather transaction-like execution of predefined prompts.
This study takes a first step towards filling this gap, focusing
on two research questions:
RQ1: How well can developers prompt LLMs to generate
code documentation compared to a predefined prompt?
RQ2: How is the developer experience for ad-hoc prompting
compared to executing predefined prompts?
To answer these questions, we conducted a randomized
controlled experiment with 20 professional developers and 30
computer science students. The experiment involved gener-
ating code documentation using an LLM-powered integrated
development environment (IDE). The first group utilized a
Visual Studio Code (VS Code) extension based on GPT-4,
enabling participants to enter ad-hoc prompts for selected
code. The second group used a similar VS Code extension
that executed a predefined few-shot prompt to generate code
documentation via GPT-4 with a single click.
Overall, we observed that developers with less experience
require more guidance in phrasing an ad-hoc prompt, whereas
more experienced developers can generate fairly good doc-
umentation by including specific keywords like “Docstring”
in the prompts. Overall, participants preferred transactional or
guided LLM interactions to create code documentation while
enjoying the flexibility of ad-hoc and iterative prompting. Our
results show that predefined prompts deliver code documen-
Experimental
group
(25 participants):
ad-hoc prompt
Control
group
(25 participants):
predefined prompt
Comprehend
a code function
Generate code
documentation
Answer comprehension
questions
Answer comprehension
questions
Evaluate generated
code documentation
Rate user
experience
For two Python functions
Answer
demographic
questions
Part 1 Part 2 Part 3
Fig. 1: Tasks of the between-subject experiment. Participants conducted the second part twice, with different code functions.
tation of significantly higher quality than ad-hoc prompts,
especially due to a consistent documentation format.
We present the design of our study in Section II and report
on the results in Section III (RQ1) and Section IV (RQ2).
We then discuss the implication of our findings in Section V
and potential threats to validity in Section VI. Finally, Section
VII summarizes related work, and Section VIII concludes the
paper. We share our experiment results, including the IDE
extension, in our replication package [20].
II. METHODOLOGY
A. Experimental Design
We designed a controlled experiment in a laboratory setting
to compare developer usage of and experience with two
VS Code extensions employing different prompts for code
documentation generation. We followed a between-subject
design, with one participant testing only one of both tools to
avoid learning effects [21]. The ad-hoc prompt group created
prompts in an already available GPT-4-powered IDE exten-
sion, while the predefined prompt group interacted with our
extension. We randomly assigned participants to the groups,
with adjustments made to balance the occupations. Participants
were unaware of their group.
We offered offline and online versions of the experiment.
In the offline version, participants used our computer with VS
Code and the digital questionnaire displayed on two screens.
In the online version, we utilized Zoom for video conferencing,
sharing a questionnaire hyperlink, and screen-sharing our
VS Code window with external input control. Participants
employed Thinking Aloud to express thoughts during the task
[22]. To ensure a calm environment, we put a participant in a
room empty of people. The questionnaire included questions
about the code and the generated documentation. Additionally,
we collected the ad-hoc prompts for a later comparison with
the predefined prompt.
The experiment comprised three parts, as shown in Figure
1. In the first part, participants answered questions about
occupation, as well as Python and VS Code experience to
assess their ability to understand complex functions and use
the IDE. Furthermore, we could check whether these variables
have any effect on the observed behavior.
The second part focused on code comprehension and
documentation generation. It consisted of two rounds, each
involving a pre-selected Python function (Listing 1). Per
round, participants (1) first attempted to comprehend the code
without any documentation. This enabled us to analyze the
effect of the LLM-generated documentation on understanding
the code functions. Subsequently, (2) participants rated their
comprehension on a scale from 1 (very low) to 5 (very high),
and answered True/False questions about the code to further
check their understanding (Table III). Participants could also
state that they had insufficient information to answer a ques-
tion. Furthermore, (3) participants manually created a function
comment to externalize their understanding and better assess
the quality of the comment subsequently generated by the tool.
To guide the participants and shorten the completion time, we
provided the participants with a comment template adhering
to guidelines for clean Python code [23] and instructed them
to add only the function description without the explanations
of parameters. Participants then used the respective LLM-
powered IDE extension to generate code documentation. Af-
terward, (4) as documentation should serve to facilitate code
comprehension, they studied the generated comment and revis-
ited the True/False questions. This step provided insights if the
documentation supported or even hindered the comprehension.
Finally, (5) they rated the generated documentation on six
quality dimensions based on the human evaluation metrics by
Hu et al. [19].
The documentation quality dimensions each provided a
question and a five-point answer scale, addressing gram-
matical correctness, readability, missing relevant information,
unnecessary information presence, usefulness for developers,
and helpfulness for code comprehension. We made minor
adjustments to the original answers of two dimensions as we
found them overloaded for our goal. In particular, the answers
for grammatical correctness originally also included a rating of
fluency, and the original readability answers also concerned the
comment’s understandability. The lowest point of each scale
represents the negative pole, like “very low readability”, while
the highest point represents the positive pole, like “very low
amount of unnecessary information”.
In the third part of the experiment, participants assessed the
usability of the respective IDE extension. They completed the
standardized User Experience Questionnaire (UEQ) with 26
items, each associated with a seven-point answer scale and one
of six usability categories [24]. Finally, participants provided
comments on tool strengths and areas for improvement in two
free-text fields.
Listing 1: Python functions that participants had to compre-
hend and for which they generated documentation.
Function 1:
def string_from_vector_bool(data):
return ",".join(str(int(i)) for i in data)
----------------------------------------------------
Function 2:
def _parse_date(date):
if date is None:
date = Timestamp()
if isinstance(date, Timestamp):
date = date.toLocal()
d = dateutil.parser.parse(date)
if d.tzinfo is None:
d = d.replace(tzinfo=dateutil.tz.tzlocal())
return d.astimezone(dateutil.tz.tzutc())
.replace(tzinfo=None).isoformat()
B. Ethical Considerations
We followed the standard procedure of the ethics committee
of our department for assessing and mitigating risks regarding
ethics and data processing. Upon welcoming participants, we
explained the privacy policy, study purpose, upcoming tasks,
and data processing procedures. We designed the study to last
for approx. 30 minutes per participant to prevent exhaustion
and instructed participants not to seek perfect solutions to
reduce stress. We offered clarifications on tasks and ques-
tionnaire content without providing solutions. We minimized
observational bias by facing away while participants answered
the questionnaire and also encouraged negative feedback [25].
To express our gratitude for the participation, we raffled
vouchers among participants. We pseudonymized all published
data as far as possible while maintaining our study objective.
C. Technical Setup
1) Selection of Experiment Tasks: We selected the Python
source code for the experiment tasks from the open-source
control system Karabo [26]. Karabo is maintained by the
scientific facility European XFEL (EuXFEL), whose devel-
opers partially participated in the experiment. By choosing
this project and involving project insiders, we could assess
how their programming and domain experience supports their
code comprehension and documentation rating, especially in
comparison to outsiders.
Two authors conducted a thorough screening of available
Karabo functions and identified six candidates. All six were
utility functions, which we deemed comprehensible for exper-
iment participants not associated with EuXFEL. Our selection
criteria also included code documentation quality aspects
such as conciseness, completeness, and usefulness especially
relevant for the few-shot examples [27]. We independently
analyzed the candidates and categorized them into the com-
plexity levels easy,medium, and hard. Afterward, we resolved
categorization conflicts by discussion.
We selected one easy-, and one medium-rated function for
the tasks of the experiment (Listing 1). The first function
converts a boolean vector into a comma-separated string.
Listing 2: Predefined few-shot prompt. Few-shots 2 and 3 are
comparable to 1 and included in the replication package [20].
For the following prompt, take into account these
three input/output pairs of functions and
corresponding appropriate comments:
Function 1:
def elapsed_tid(cls, reference, new):
time_difference = new.toTimestamp() - reference.
toTimestamp()
return np.int64(time_difference *1.0e6 // cls.
_period)
Comment 1:
"""
Calculate the elapsed trainId between reference
and newest timestamp
:param reference: the reference timestamp
:param new: the new timestamp
:type reference: Timestamp
:type new: Timestamp
:returns: elapsed trainId’s between reference
and new timestamp
"""
Function 2:
...
Generate a comment for the following function:
{FUNCTION_CONTENT}. Fill in this template:
{TEMPLATE}. Adhere to the appropriate comment
syntax for multi-line comments.
We considered the code easy to understand as only standard
Python operations were used within this short function. The
second function converts a given date to the ISO format. The
complexity of this function lies in the usage of multiple classes
and functions with abbreviated identifiers, and the lack of
inline comments. Hence, participants must make assumptions
based on identifier names and usage.
Initially, we also selected a third function rated hard but
removed it after a pilot study with three software engineering
researchers, who reported difficulties with this function and
far exceeded the time limit. While we acknowledge that
LLMs might require more sophisticated prompts and context
information to generate appropriate documentation for more
complex code, our user study focused on how differently
skilled developers can and prefer to interact with an LLM for
the creative process of documentation generation. Therefore,
within the experiment’s time limit, all participants should
be able to phrase ad-hoc prompts that generate appropriate
comments for these functions.
2) LLM Selection and Prompt Construction: Research
and industry have introduced multiple LLMs with individual
strengths and limitations. For our study, we focused solely on
the GPT-4 model by OpenAI (version gpt-4-0613). We did
not aim to benchmark different models but rather to study de-
velopers prompting skills particularly for code documentation
generation. OpenAI’s models have been prominent in recent
research on documentation generation [28]. Furthermore, dur-
(a) Input field of the ad-hoc prompt tool.
(b) “Generate a comment” action of the predefined prompt tool.
Fig. 2: VS Code extensions applied in the experiment.
ing our study period, GPT-4 demonstrated general superiority
over other models [8], [19].
We defined the predefined few-shot prompt (Listing 2) based
on prompt engineering guidelines [14], [15], [29] and tested it
in a pilot study. Given GPT-4’s limited context window at the
time of the study, we applied a three-shot approach to specify
the required output format. We chose three distinct function-
comment pairs from the remaining previously selected Karabo
code candidates. Hence, these functions were from the same
domain and of similar complexity as the two used in the exper-
iment. They adhered to a structured documentation format with
precise and concise content. We instructed GPT-4 to generate a
comment for the selected source code that we passed as input
(FUNCTION CONTENT). To further enforce the clean code
format [23], we explicitly defined an output template within
the prompt (TEMPLATE) and requested adherence to Python’s
syntax for multi-line comments (Docstrings). Listing 2 only
contains example 1 of the few-shot prompt. Examples 2 and 3
are available in our replication package [20]. We acknowledge
that the prompt can be further optimized and generalized to
multiple programming languages and application domains. We
considered its performance and complexity sufficient for our
research goal.
3) Tool of the Ad-hoc Prompt Group: We focused on IDE
extensions as developers usually use IDEs for generating code
documentation. We chose VS Code for its extensive cus-
tomization capabilities through various extensions maintained
by an open-source community. During our study, the exten-
sion marketplace already offered multiple ChatGPT-like tools
that we examined. Additionally, VS Code supports multiple
programming languages, including Python.
We selected the ad-hoc prompt tool based on two criteria.
First, it needed to provide a user interface similar to the
ChatGPT interface, allowing users to engage in a conversa-
tion with the software. This approach involves at least one
self-written message to the model and copying and pasting
the generated comment from the chat interface to the code.
This criterion was crucial as we aimed to compare the user
experience for this classic Q&A-based interaction with the one
for the predefined prompt tool, which generates and inserts
code documentation with a button click. Second, the tool had
to support and behave like the original GPT-4 application
programing interface (API) to handle any confounding factors
related to the use of different LLM models.
After a thorough testing of available extensions, we opted
for ChatGPT by zhang-renyang [30] in version 1.6.62. This
extension met all specified requirements. It offers a small
prompt field (Figure 2a) and a larger ChatGPT-like chat
window within the IDE. Our test showed that it mimics GPT-
4. It was one of the most popular GPT extensions in the
marketplace during our study period.
4) Tool of the Predefined Prompt Group: Our design objec-
tive for the predefined prompt group was to minimize the steps
required for developers to generate a high-quality Docstring
comment for a specific code section. We determined that mark-
ing the code within the IDE’s code editor and selecting the
“Generate a comment” action from the context menu should
suffice, requiring three clicks to generate documentation (Fig-
ure 2b). Alternatively, the developer can choose this action
through VS Code’s command field. Hence, inexperienced and
experienced VS Code users can use the primary feature of the
tool with their preferred workflow. A secondary feature allows
developers to customize the predefined prompt via the “Edit
default prompt” action, enabling them to tailor the tool output
to their preferred format.
We followed the VS Code guidelines to create the Type-
Script-based extension [31] and utilized the axios package
to implement calls to the OpenAI API. Based on the input
prompt, the API returns a code comment generated by GPT-4.
The extension automatically inserts this documentation above
the previously selected code. To set up the extension, devel-
opers must enter their OpenAI API token into a configuration
field. This token is stored locally on the developer’s machine.
For the experiment, we pre-configured the tool with our token.
We used GPT-4 version gpt-4-0613 for both tools. Gener-
ating code documentation with this model took, on average,
four seconds. The VS Code version was 1.85. The predefined
prompt tool is included in our replication package [20] and
also available in the VS Code marketplace [32].
D. Experiment Execution and Participants
The experiment took place in October and November 2023,
involving 50 participants: 25 per ad-hoc and predefined prompt
group. 20 participants were professionals, including 12 soft-
ware engineers, four data scientists, and four with other soft-
ware development roles. 17 worked in the domains of scientific
computing and three in finance. These participants took the
expert perspective due to their professional experience with
Python programming and, eventually, the application domain.
The remaining 30 participants were students in computer sci-
ence or related subjects, taking the newcomer perspective. We
recruited participants through personal contacts in academia
and industry to ensure that the participants had at least some
experience with Python and VS Code. We conducted 33
experiments online and 17 offline at the EuXFEL campus
in Schenefeld, Germany, with the first author leading the
experiment and the second author assisting.
On a scale from 1 (very low) to 5 (very high), the mean
value and standard deviation of the Python experience among
professionals were 3.7 (1.06) in the ad-hoc and 4.5 (0.71) in
the predefined prompt group. The values among students were
notably lower, with 2.53 (0.83) in the ad-hoc and 2.47 (1.19)
in the predefined prompt group. This difference confirms the
expert/newcomer setting.
All participants were familiar with VS Code, which was
important to mitigate biases due to the tool setup. Mean values
and standard deviations in the ad-hoc/predefined prompt group
among professionals were 2.9 (0.99) and 3.5 (1.18); among
students they were 3.07 (1.03) and 3.07 (1.28).
At the beginning of each session, we trained each participant
in using the respective tool to mitigate learning bias. We
introduced them to the tool features, generated documentation
for an example function, and answered tool-related questions.
Afterward, we started the experiment. Due to manually writing
prompts, ad-hoc prompt group participants completed the
experiment after 23:53 minutes on average, while predefined
prompt group participants required only 19:58 minutes.
E. Data Analysis
Our experiment comprised quantitative and qualitative data
analyses. We report totals, mean values, and standard devia-
tions where applicable. Furthermore, we employed inferential
statistics to test for significant differences between the groups’
ratings of the six documentation quality dimensions. Depend-
ing on the statistical assumptions fulfilled by the respective
variables, we utilized the parametric Welch’s t test or the non-
parametric Mann-Whitney U (MWU) test [33], [34].
The assumptions for both tests were that the collected data
points were quantitative and independent. The first assumption
was fulfilled as we provided a five-point answer scale for
each question related to a quality dimension. The second
assumption was fulfilled as we conducted the experiment with
each participant individually, and we asked participants to not
share information with others to avoid external influences. The
third assumption of the Welch’s t test is a normal distribution
of data points, which we tested with the Shapiro-Wilk test
TABLE I: Frequency of successful ad-hoc prompts for both
tasks, subdivided by occupations.
Prompt pattern All Students Professionals
Write a comment 34 24 10
Write a Docstring 12 3 9
Explain the function in a comment 2 2 0
Write a Python-conform description 1 1 0
Summarize the function in a comment 1 0 1
[35]. If a normal distribution was not met for a specific
quality dimension, we applied the MWU test instead. Besides
reporting the p-values for statistical significance, we also
report the effect sizes to indicate practical significance. We
particularly applied Cliff’s δ[36], which describes the overlap
of two groups of ordinal numbers. Values between 0.0 and
0.8 indicate a low to medium effect size, whereas values from
0.8 represent a large effect size, meaning a large difference
between both groups.
To quantify the experience of developers using the ex-
periment tools, we utilized the 26 seven-point scales of the
UEQ [24] and its official data analysis tool [37]. This tool
provides descriptive (mean values and standard deviations) and
inferential statistics (t test) to compare user experience ratings
for the two studied IDE extensions.
Each of the 26 scales is related to one of six quality
categories. The first category, Attractiveness, represents the
pure valence of and overall satisfaction with a tool. Efficiency,
Perspicuity, and Dependability are the pragmatic, goal-directed
categories, and Stimulation and Novelty are hedonistic, not
goal-directed categories. We chose Efficiency, Perspicuity,
Dependability, and Novelty to measure tool-related experience,
as these categories focus on tool interactions.
To evaluate participants’ assessment of the quality of gen-
erated documentation, we used the Stimulation category as it
focuses on the output of a tool. We enhanced this analysis with
qualitative insights made by manually analyzing participants’
free-text answers to the questions “What did you like most
about the tool?” and “What would you change about the
tool?”, placed at the end of the questionnaire. Two authors
independently conducted Open Coding for all answers and
discussed their codes to achieve consensus [38]. Furthermore,
we report relevant remarks expressed by participants during
the experiment.
III. PROM PT IN G FO R CODE DOCUMENTATION (RQ1)
A. Patterns in Ad-hoc Prompts
The prompts entered by the ad-hoc prompt group varied in
detail and keywords. About half of the participants had to re-
enter a different prompt after their first attempt to generate a
code comment. Among the failed attempts were prompts such
as “Explain the function” and “Describe what this function
does”, which summarized the code in longer prose text.
We categorized the 50 prompts that led to a successful
generation of code comments for both code functions into five
prompt patterns, listed in Table I. These patterns commonly in-
cluded documentation-specific keywords, i.e., comment (34x),
TABLE II: Ratings of the generated documentation along six quality dimensions (mean values and standard deviations of
5-point-scales) in ad-hoc and predefined prompt groups; for all participants, students, and professionals; for code functions 1
and 2. Higher ratings always imply better quality. Every third line presents the p-value of the MWU test and Cohen’s Delta.
Group Subgroup Code
Function
Grammatical
Correctness Readability Missing
Information
Unnecessary
Information Usefulness Helpfulness
Ad-hoc All 1 4.88 (0.33) 4.12 (0.88) 4.64 (0.76) 3.28 (1.21) 4.12 (0.83) 4.52 (0.59)
Predefined All 1 4.88 (0.33) 4.72 (0.54) 4.36 (1.11) 4.32 (1.25) 4.44 (1.04) 4.32 (1.07)
p-val. |δAll 1 1.0 |0.0 0.005 |0.82 0.35 |0.29 0.0009 |0.85 0.08 |0.3 0.91 |0.23
Ad-hoc All 2 4.48 (0.92) 3.52 (1.39) 3.92 (1.29) 2.84 (1.60) 3.52 (1.39) 4.0 (1.08)
Predefined All 2 4.76 (0.66) 4.68 (0.48) 4.16 (0.85) 4.68 (0.85) 4.6 (0.76) 4.6 (0.82)
p-val. |δAll 2 0.12 |0.35 0.0004 |1.12 0.78 |0.22 0.00002 |1.44 0.002 |0.96 0.02 |0.63
Ad-hoc Students 1 4.86 (0.35) 4.0 (1.0) 4.6 (0.63) 3.33 (1.35) 3.93 (0.88) 4.46 (0.64)
Predefined Students 1 4.93 (0.26) 4.66 (0.62) 4.6 (0.83) 4.86 (0.35) 4.6 (0.63) 4.6 (0.63)
p-val. |δStudents 1 0.58 |0.22 0.03 |0.8 0.76 |0.0 0.0002 |1.56 0.03 |0.87 0.52 |0.21
Ad-hoc Students 2 4.4 (1.06) 3.2 (1.47) 3.86 (1.30) 2.66 (1.68) 3.13 (1.41) 3.66 (1.23)
Predefined Students 2 4.73 (0.80) 4.6 (0.51) 4.06 (0.96) 4.86 (0.35) 4.73 (0.46) 4.6 (0.74)
p-val. |δStudents 2 0.13 |0.36 0.002 |1.27 0.84 |0.17 0.0002 |1.82 0.0008 |1.53 0.02 |0.92
Ad-hoc Professionals 1 4.9 (0.32) 4.3 (0.67) 4.7 (0.95) 3.2 (1.03) 4.4 (0.70) 4.6 (0.52)
Predefined Professionals 1 4.8 (0.42) 4.8 (0.42) 4.0 (1.41) 3.5 (1.65) 4.1 (1.45) 3.9 (1.45)
p-val. |δProfessionals 1 0.58 |0.27 0.07 |0.89 0.08 |0.58 0.51 |0.22 0.97 |0.26 0.38 |0.64
Ad-hoc Professionals 2 4.6 (0.70) 4.0 (1.15) 4.0 (1.33) 3.1 (1.52) 4.1 (1.20) 4.5 (0.53)
Predefined Professionals 2 4.8 (0.42) 4.8 (0.42) 4.3 (0.67) 4.4 (1.26) 4.4 (1.07) 4.6 (0.97)
p-val. |δProfessionals 2 0.58 |0.35 0.06 |0.92 0.9 |0.28 0.07 |0.93 0.47 |0.26 0.28 |0.13
Docstring (12), and Python-conform description (1). Also,
slightly longer prompts that included the comprehension-
focused terms explain (2) and summarize (1) led to appropriate
function comments. Professionals used the Docstring term
more often (9x) than students (3x), which is in line with the
higher Python experience of most participating professionals.
Successful prompts were generally short, with a mean length
and standard deviation of 7.1 (2.09) words among profession-
als and 6.7 (2.62) words among students. The shortest prompt
was “Generate docstring”. None of the participants enhanced
their prompts with examples or templates.
B. Perceived Quality of Generated Documentation
Participants rated the documentation generated by the ad-
hoc and predefined prompt tools based on six five-point-scale
questions. All participants were aware of the expected format
of a Python Docstring, as they were familiar with Python.
Moreover, before using the tool, they manually created a func-
tion comment based on a Docstring template. Table II displays
the ratings per group, subgroup (student vs. professional), code
function, and quality dimension, along with the p-values [33],
[34] and effect sizes [36].
We performed several trials to determine the appropriate
statistical test (see section II-E for details). We rejected the
null hypothesis of the Shapiro-Wilk test for all groups and
dimensions, indicating a non-normal data distribution, likely
due to small sample sizes. We chose the MWU test for all com-
parisons due to its suitability for non-normally distributed data
[34]. The null hypothesis of this test indicates no statistically
significant difference between the mean values of both groups
for a specific dimension. We rejected this null hypothesis if
the p-value was lower than 0.05. We noted that in all cases
when the null hypothesis was rejected, the effect size δwas
large (0.8 or higher), indicating practical significance [36].
Comparing all participants in the ad-hoc and predefined
prompt groups, we observed several statistically significant
differences. For function 1, we found significant differences
in the dimensions of Readability (4.12/4.72) and Unnecessary
Information (3.28/4.32). The predefined prompt tool consis-
tently provided concise Docstrings that participants perceived
as more readable and contained less non-informative content
compared to the output of the ad-hoc prompt tool. For function
2, significant differences were found in four dimensions,
with the predefined prompt tool consistently receiving higher
ratings than the ad-hoc prompt tool. The differences in Read-
ability (3.52/4.68) and Unnecessary Information (2.84/4.68)
reinforced the conclusions drawn from function 1.
We attribute the ratings of Usefulness (3.52/4.6) to the
often excessive output length of the ad-hoc prompt tool.
Participants noted that longer, prose documentation could
impede development workflows, as these comments require
more time to read while providing little additional information.
Helpfulness was rated high in both groups (4.0/4.6), with some
participants suggesting that the explanations provided by both
tools were either too complicated or lacked the required level
of detail. Across both groups, participants rated Grammatical
Correctness and Missing Information very high.
When we compared documentation ratings between students
and professionals, we observed similar statistically significant
differences in the ad-hoc and predefined prompt groups for
students, while professionals did not exhibit significant differ-
ences. Among students, we observed significant differences
for function 1 in the dimensions Readability (4.0/4.66), Unnec-
essary Information (3.33/4.86), and Usefulness (3.93/4.6). The
different expectations regarding the documentation content
contributed to the disparity in Usefulness, with some students
preferring the longer summarizations by the ad-hoc prompt
tool, while most others expected a structured comment as
taught in university.
TABLE III: Count of correct answers for comprehension questions (Q1, Q2) for code functions F1 and F2 in the ad-hoc and
predefined prompt groups: before/after using the tool. Colors indicate if correct answers increased after using the tool.
All Students Professionals
ID Question Ad-hoc Predef. Ad-hoc Predef. Ad-hoc Predef.
F1 Q1 Does the function return a list of boolean values as its output? 19/22 23/24 11/14 13/15 8/8 10/9
F1 Q2 Does the function take an iterable as its input? 23/21 24/20 14/13 14/12 9/810/8
F2 Q1 Does the function modify the input date’s time zone to UTC? 10/22 14/18 5/13 8/11 5/96/7
F2 Q2 Does the function parse JSON data into a dictionary structure? 15/21 13/21 7/12 5/14 8/98/7
For function 2, students in the predefined prompt group
consistently provided significantly better ratings for Readabil-
ity (3.2/4.6), Unnecessary Information (2.66/4.86), Usefulness
(3.13/4.73), and Helpfulness (3.66/4.6) compared to students
in the ad-hoc prompt group. Unnecessary Information had the
highest differences, likely due to the unstructured and lengthy
comments generated via the ad-hoc prompts.
Among the professionals, we found no statistically signif-
icant differences between the ad-hoc and predefined prompt
groups. However, three dimensions showed large effect sizes
(≥0.8), indicating a practical significance. These differences
concerned Readability for functions 1 (4.3/4.8) and 2 (4.0/4.8),
and Unnecessary Information for function 2 (3.1/4.4). This
indicates that professionals also found the comments of the ad-
hoc prompt tool to contain non-informative content, impairing
readability. Conversely, the ad-hoc prompt group provided a
more positive rating for the dimension Missing Information
(4.7/4.0) for function 1. Professionals expected more detailed
descriptions, which the concise comments by the predefined
prompt tool did not provide. Furthermore, as professionals
used the Docstring keyword more often than students in their
prompts (Table I), the ad-hoc prompts in these cases resulted
in similar documentation to the predefined prompt tool.
C. Code Comprehension
We analyzed how the LLM-generated documentation influ-
enced the comprehension of the code. Initially, participants
rated their perceived comprehension of the respective code
function without available documentation, on a scale from 1
(“not at all”) to 5 (“very well”). For function 1, the mean and
standard deviation for these ratings were 3.6 (0.96) by the ad-
hoc and 3.84 (1.11) by the predefined prompt group, indicating
a similar good understanding. The assessment for function 2
was lower, with 2.4 (0.96) by the ad-hoc and 2.92 (0.81) by the
predefined prompt group. Hence, the predefined prompt group
expressed slightly higher confidence in the comprehension
than the ad-hoc prompt group, despite similar stated Python
experience. Across both groups and functions, professionals
stated a higher comprehension than students, which aligns with
their higher Python experience.
We assessed participants’ actual comprehension through
True/False questions about the code (Table III). We counted
the number of participants who correctly answered these ques-
tions before and after generating the documentation. Partici-
pants could also choose the option “Not enough information
available to answer the question”, which we treated as an
incorrect answer.
Before using the tools, most participants in both groups an-
swered the two questions for function 1 correctly (Q1: 19+23;
Q2: 23+24), as expected for this short and easy-categorized
function. After using the tool, the correct answers for question
1 increased (Q1: 22+24), especially among students, while
the ones for question 2 slightly decreased (Q2: 21+20) for
both students and professionals. Thinking-aloud observations
revealed that many participants were confused by the less
technical documentation generated by both tools regarding this
question, as it did not include the term iterable.
The number of correct answers to function 2 before using
the tool was lower (Q1: 10+14; Q2: 15+13), reflecting its
higher complexity. In both groups, correct answers increased
after using the tools (Q1: 22+18; Q2: 21+21). Students had
the highest increases: in the ad-hoc prompt group for Q1 from
5 to 13, and in the predefined prompt group for Q2 from 5
to 14. The increase in comprehension among professionals
was marginal, except for the correct Q1 answers in the ad-hoc
prompt group, which increased from 5 to 9. This data shows
that the availability of documentation helped inexperienced
developers to better comprehend the code, while the effect on
professionals was mixed.
Answering RQ1, we conclude that developers were
able to generate code documentation using ad-hoc
prompts without prompt engineering techniques, often
by providing relevant documentation keywords. How-
ever, particularly less experienced students initially did
not use such keywords and generated lengthy, less
readable, and less useful code explanations instead.
This is reflected in participants rating the quality
of generated documentation. We observed statistically
significant differences exclusively among students,
who rated the readability, conciseness, usefulness, and
helpfulness of comments generated via the predefined
few-shot prompt higher than those created via ad-
hoc prompts. Although we observed large effect sizes
for some quality dimensions among professionals, this
subgroup was generally less critical as they often
generated documentation of the required quality and
format by using specific keywords.
We also conclude that documentation generated by
both tools helped (particularly students) comprehend
more complex code. The generated documentation
could also lead to confusion, even for professionals.
TABLE IV: Comparison of developer experience: Ratings of the six UEQ categories (mean values and standard deviations)
from the ad-hoc and predefined prompt groups, including the p-values (t test) and Cohen’s Delta.
Group Subgroup Attractiveness Efficiency Perspicuity Dependability Stimulation Novelty
Ad-hoc All 1.01 (1.14) 0.70 (1.19) 1.25 (0.68) 1.05 (1.05) 0.82 (1.13) 1.10 (1.15)
Predefined All 1.99 (0.63) 1.98 (0.67) 1.87 (0.47) 1.84 (0.79) 1.62 (0.83) 1.18 (0.92)
p-val. |δAll 0.0006 |1.06 0.0000 |1.33 0.0005 |0.71 0.004 |0.85 0.007 |0.81 0.79 |0.08
Ad-hoc Students 0.64 (1.14) 0.42 (1.0) 1.33 (0.74) 0.85 (1.15) 0.6 (1.23) 0.92 (1.31)
Predefined Students 1.94 (0.65) 2.12 (0.68) 1.98 (0.27) 1.95 (0.87) 1.6 (0.88) 1.18 (0.85)
p-val. |δStudents 0.0009 |1.4 0.0000 |1.99 0.005 |1.17 0.007 |1.08 0.02 |0.94 0.51 |0.24
Ad-hoc Professionals 1.55 (0.94) 1.13 (1.38) 1.13 (0.58) 1.35 (0.83) 1.15 (0.93) 1.38 (0.84)
Predefined Professionals 2.05 (0.64) 1.78 (0.64) 1.7 (0.65) 1.68 (0.68) 1.65 (0.78) 1.18 (1.06)
p-val. |δProfessionals 0.18 |0.62 0.2 |0.6 0.05 |0.93 0.35 |0.43 0.21 |0.58 0.65 |0.21
IV. DEV EL OP ER EXPERIENCES WI TH PROMPTING (RQ2)
A. Overall Experience
Participants rated both tools positively in all six categories
of the standardized UEQ [24], with values ranging from -3 to
+3 (Table IV). However, students found the predefined prompt
tool more positive in all user experience categories than the ad-
hoc prompt tool, with statistically significant differences and
large effect sizes in five categories. Professionals also rated
the predefined prompt tool more positively in five categories,
without significant differences but a large effect size in one
category. For the pure valence category Attractiveness (1.01
ad-hoc / 1.99 predefined prompt group), the predefined
prompt tool received higher ratings than the ad-hoc prompt
tool, with a high difference among students (0.64/1.94). This
indicates an overall better developer experience with the
predefined prompt tool.
B. Tool-Related Experience
The most significant difference between the groups was
in the category Efficiency (0.7/1.98), with the predefined
prompt tool receiving higher ratings, especially among stu-
dents (0.42/2.12). Participants praised the ad-hoc prompt tool
for its practicality of not having to open ChatGPT in the
browser (two students, one professional), flexibility in select-
ing relevant code (one professional), and custom prompts (one
professional). They noted issues with copy-pasting the tool
output to the code panel (six students, four professionals),
multiple tries sometimes required for sufficient results (one
student), the need to phrase a prompt (one professional), miss-
ing keyboard shortcuts (one professional), and the slow output
generation, which is primarily caused by the performance of
GPT-4 (one student). Participants who interacted with the
predefined prompt tool also noted the slow response (one
student, two professionals), but expressed no further positive
or negative comments.
The ratings for Perspicuity (1.25/1.87) were higher for the
predefined prompt tool, with large effect sizes for students and
professionals. While some participants of the ad-hoc prompt
group participants found their tool easy to use (four students,
one professional), others reported that they would find it
easier to enter the prompt in a separate IDE panel instead
of the pop-up at the top of the screen (one professional)
and that the right-click context menu, which listed the tool
features besides non-related IDE features, was obscure to them
(one professional). Multiple predefined prompt group partic-
ipants also found their tool easy to use (eight students, four
professionals), highlighting the ease of generating comments
without the need for further prompt input besides the code
and a short command (one professional). Nevertheless, they
assessed that the selection of relevant code could be facilitated
(one professional). Overall, this assessment aligns with our
observation that participants quickly understood both IDE
extensions but tended to prefer the simplicity of the predefined
prompt tool without having to create an effective prompt.
Ratings for Dependability (1.05/1.84) showed strongly di-
verging views among students (0.85/1.95). Participants praised
the ad-hoc prompt tool for supporting the comment creation
workflow (six students, two professionals) and code compre-
hension (three students, four professionals). They criticized
the lack of prompt templates (three students), that the tool did
not always generate code comments (two students, one profes-
sional), and the missing opportunity for iterative improvements
caused by the limitations of the OpenAI API at the time
of our study (one student, one professional). The predefined
prompt tool was also noted for facilitating the workflow
(three students, one professional) and code comprehension
(four students), as well as for its reliability (two profession-
als). However, participants suggested adding a progress bar
(one professional), previewing generated comments within the
function (one professional), enabling the configuration of the
required comment content (one student), adding a disclaimer
that the output might be incorrect (one student), implementing
a linter that alerts when generated comments became outdated
(one professional), and using a locally deployed API for
data security (one professional). Besides these improvement
comments, the predefined prompt group was overall satis-
fied regarding the Dependability aspects, whereas the ad-
hoc prompt group, and especially the students, required more
enhanced tool support.
Both tools achieved similar positive results concerning
Novelty (1.1/1.18), without significant differences. Interest-
ingly, the ad-hoc prompt tool received a higher rating from
professionals (1.38/1.18) for its chat-like interaction within
an IDE, which they considered more novel than clicking a
button for documentation generation. Participants expressed
no further comments regarding this UEQ category.
C. Documentation-Related Experience
For Stimulation (0.82/1.62), predefined prompt tool ratings
were higher, especially among students (0.6/1.6). Concerning
the output of the ad-hoc prompt tool, participants noted the
detailed explanations (two students, two professionals), the
mitigation of personal biases (one student), inspirational value
(one professional), and overall high quality (three profession-
als). Other participants criticized the output for containing un-
necessary information (five students, three professionals), lack
of detail regarding complex code (three students), and incon-
sistent format due to using prompts of varying qualities (two
students, one professional). The predefined prompt tool output
received positive ratings for its conciseness (four students, one
professional), overall high quality (three professionals), and
similar structure across all generated comments (one student,
one professional). Improvements were deemed necessary for
the parameter descriptions (one student, one professional),
line breaks (one professional), consistent punctuation (one
student), and lack of detail for complex code (one student,
one professional). These insights show that students were
especially unsatisfied with the ad-hoc prompt tool output,
while professionals were overall satisfied with both tools.
Answering RQ2, we conclude that the participants
overall preferred a tool that automates code docu-
mentation generation with a few clicks while offering
options to configure this process. On average, pro-
fessionals and students preferred executing predefined
prompts over creating ad-hoc prompts. The simplicity
and efficiency of a single button click to receive
consistently high-quality documentation were key fac-
tors for this assessment. While the predefined prompt
group praised their tool, some missed the flexibility
to adjust the documentation depth. Users of the ad-
hoc prompt tool appreciated this flexibility, which often
unintentionally resulted in longer and more explanatory
comments that may assist code comprehension.
V. DISCUSSION
We summarize the results of our experiment and discuss
potential implications for research and practice.
A. Developers are not Necessarily Prompting Experts
Our study indicates that developers are generally neither
skilled (RQ1) nor inclined (RQ2) to effectively prompt an
LLM in a way that it generates concise, readable code doc-
umentation. This partly aligns with previous studies showing
that optimized, predefined prompts outperform ad-hoc prompts
[18]. But it also partly contradicts the underlying assumption
that effective optimized prompts, such as few-shot prompts,
are trivial and can intuitively be created by developers in
their daily work. We observed in our experiment that partici-
pants interacted with the provided tools intuitively via natural
language queries. However, they were often disappointed by
the results as the tools provided outputs that were aligned
with their unspecific prompts. In fact, we even observed that
the generated documentation can partly be misleading during
comprehending code.
The lack of prompting skills among developers is unsur-
prising, as this is not a standard skill taught in universities
or commonly practiced in software development. Therefore,
researchers should explore ways to learn effective prompting
and to intuitively support conversations with LLMs, particu-
larly in domains like software engineering where the LLM
outcome has an impact on other users. One approach is to
compile a catalog of evaluated prompt templates for different
well-defined tasks (e.g., documentation of Python functions).
Recently, Torricelli et al. also found such templates beneficial
for LLM users [39]. Tool vendors could integrate the templates
into their tools. Each template could include instructions for
the LLM to ask clarifying and contextual questions, e.g.,
related to the expected documentation style [40]. Additionally,
educators could use these prompt templates, together with
underlying ideas and hints, to teach prompting skills in schools
and universities. Such templates should only provide guidance
and should not hinder human creativity in asking out-of-the-
box and follow-up questions to an LLM-powered tool.
B. Generating Documentation With LLMs is an Iterative Task
Our results show that the participants rated the quality
of documentation generated via the predefined prompt as
high. However, many were not completely satisfied with the
documentation quality after this initial iteration. The first
iteration helped to comprehend the code, formulate the initial
comment version, and gather ideas to alter and extend the
documentation during further iterations. This indicates that
generating code documentation is rather an iterative task, not
to be fully automated without the feedback of developers.
To better understand and assist this task, additional research
should be conducted to identify and satisfy the preferences
of code documentation providers and consumers [40]–[42].
Personal aspects include the depth of explanations, tone of
the text, and lyrical style. Project aspects include documen-
tation style used [40] and knowledge needs of code users.
This research can help vendors of artificial intelligence (AI)-
powered code documentation tools align the generated com-
ments with the requirements of their users. Thus, vendors
can focus on applying and optimizing predefined prompts,
which participants in our experiment preferred, while also
offering additional iterations to customize the output based
on preferences and needs. Furthermore, generating multiple
documentation versions at once might also stimulate developer
creativity in exploring possibilities to document code [43].
C. Evaluating Code Documentation Quality Requires Human-
Centered Quality Metrics
As Hu et al. [19] pointed out, current metrics used to
assess the quality of AI-generated content do not align with
human evaluation regarding code documentation generation.
For instance, the usefulness of generated content [42] should
be crucial in practice but is barely evaluated in common
LLM benchmarks. Our results suggest that the assessment of
code documentation quality depends on multiple factors [44],
such as the preferences of documentation consumers, the ratio
between documentation conciseness and code complexity, and
the availability of certain information and style that developers
expect to be included in the documentation. Current metrics
for automatically assessing the quality of generated texts
do not incorporate these aspects. Hence, while automatically
generating code documentation is fairly straightforward with
LLMs, an expert assessment is still required to ensure good
quality, e.g., during a code review. It remains unclear how far
this assessment can be fully automated, e.g., by other agents
or other LLMs. Future research should focus on creating and
testing quality metrics for generated documentation through
documentation analyses, expert surveys, and developer and
benchmark studies.
D. Better Code Comprehension Requires Support by Person-
alized LLM-Generated Explanations On The Fly
The results of our code comprehension tasks show that
LLMs can support developers in understanding code. Espe-
cially, the longer explanations provided by the ad-hoc prompt
tool in response to vague prompts helped participants under-
stand the given code function. However, such explanations
can also mislead developers if the generated output does not
consider the information needs [45] or provide the particular
knowledge nuggets needed for the comprehension task [46].
Therefore, similar to the work of Nam et al. [47], we
propose that vendors of IDE-integrated code comprehension
tools should offer multiple code-related aspects that tool users
can explore during the comprehension process. Examples
include the concepts mentioned in the code or the typical code
usage [47], but could also extend to explaining the code in
a personalized way. Researchers should survey developers to
identify additional aspects and test their relevance in practice.
E. Practitioners May Have Specific LLM Requirements
Participants in our study expressed multiple requirements
for an LLM-powered documentation generation tool. For ex-
ample, participants expressed concerns regarding security risks
associated with using closed-source, externally hosted models
and highlighted the extended time required for generating
comments via the LLM model we used.
To address these issues, tool vendors should support a
variety of LLMs within a single tool, including smaller, open-
source models, to better meet the needs of organizations oper-
ating with sensitive data and under specific constraints. One of
the research challenges is that different models may respond
variably to the same prompt, impacting the consistency and
reliability of generated outputs.
VI. TH RE ATS TO VALIDITY
A. Construct Validity
(T01) Our study exclusively employed human evaluation
techniques to assess code documentation quality, introducing a
mono-method bias [21]. Although automated metrics are used
in other studies, we opted for human evaluation due to its lim-
ited correlation with automated metrics [19], and considered
it beneficial for a more comprehensive understanding.
B. Internal Validity
(T02) We unintentionally excluded the Understandable
scale from our UEQ, which may compromise the Perspicuity
category’s accuracy. Remedying this, we calculated Perspicu-
ity scores with and without the scale’s median value, finding
only marginal differences. Despite this omission, participant
feedback indicated an adequate understanding of the tool.
(T03) Customizability of the predefined prompt could have
positively influenced participant satisfaction and tool assess-
ment. However, allowing customization could have affected
the comparability of assessments among the predefined prompt
group. (T04) Lack of significant differences in professional
ratings may be attributed to the small sample size of ten pro-
fessionals per group. Nevertheless, descriptive statistics reveal
fewer differences among the professionals’ ratings compared
to students, likely due to their more targeted use of keywords
in ad-hoc prompts. (T05) A potential language barrier existed
given the non-native English-speaking background of most
participants. However, this was deemed a minor threat as all
were able to produce coherent prompts in English.
C. External Validity
(T06) Our controlled lab setup, where developers docu-
mented unfamiliar code, may not accurately represent typical
documentation practices. However, this setting was intended to
eliminate variability and enable statistical analysis. (T07) The
professionals in our study predominantly came from EuXFEL,
with specific expertise in Python, which might influence the
results due to familiarity with the experimental code. This
was seen as potentially beneficial, providing insight into how
domain knowledge of one experiment group might improve
their prompt creation. (T08) The findings’ generalizability is
limited, yet they corroborate and broaden the insights from
similar studies that involved different methodologies.
D. Conclusion Validity
(T09) The selection of utility functions as code candidates
might limit the applicability of our results to more complex
coding tasks. We chose these functions based on a pilot study
confirming the difficulty of comprehending challenging code
under time constraints. (T10) Variations in LLM outputs due
to inherent randomness [10] may have influenced assessments
in the predefined prompt group, though differences in outputs
were marginal. (T11) Dividing the experimental and control
groups into professionals (experts) and students (newcomers)
reduced overall sample sizes, weakening statistical power but
allowing for a more detailed analysis of differing perceptions
between these groups. The separation between experts and
newcomers was confirmed by the participants’ self-perceived
Python experience. Hence, we believe our findings can be
generalized to a broader population. Nevertheless, developers
with different expertise in conversational agents might behave
differently.
VII. REL ATED WO RK
A. Prompting Skills of Practitioners
Prompt engineering has been pivotal since the introduction
of few-shot prompting [16]. Ahmed et al. demonstrated its ef-
fectiveness in code summarization [18], [48]. They highlighted
that applying a few-shot prompt can outperform traditional
models like CodeBERT [49], while zero- and one-shot prompts
were less effective, aligning with findings of Geng et al.
[50]. Meanwhile, subsequent techniques, like chain-of-thought
prompting [51] and active prompts [52], have emerged.
Prompt engineering remains a skill requiring practice [53].
LLM providers and researchers have published guidelines
and prompt pattern catalogs for effective LLM interaction
across various domains, including healthcare [14], [15], [54],
[55]. Our study is among the first to examine if and how
well developers utilize prompt-related knowledge in software
engineering tasks.
B. LLM-Powered Developer Assistants
Recent studies on AI programming assistants show that
tools like GitHub Copilot [56] boost productivity, reduce
keystrokes, and assist in syntax recall [43], [57]. However,
users desire tool enhancements for direct feedback, personal-
ized output, and improved code context understanding [43].
Currently, they often avoid these tools due to unmet require-
ments and lack of control. For example, Copilot often fails
to grasp instructions for code adjustments unless precisely
specified [11]. However, tool shortcuts to optimize an initial
prompt can also nudge users to explore fewer prompts and
lead to less diverse output [39]. We can confirm these findings
with our study. While our participants overall preferred the
interaction with and the output of the predefined prompt
tool, the more exploratory ad-hoc prompts often led to more
extensive output that supported code comprehension.
C. Code Documentation Generation in the IDE
In the last decade, researchers have developed various
methods for generating code documentation [7], including
template-based [58], [59], information retrieval [60], and, more
recently, deep learning techniques [49], [61]–[66]. Especially
deep learning approaches trained on large datasets have shown
high performance in code documentation generation and en-
abled better scalability and more complete results than the
other techniques [7]. We found many available IDE extensions
that use these AI models to facilitate developer workflows.
However, they were conceptualized as general-purpose tools
and lacked code documentation features. In the academic
realm, the VS Code extension Divinator was the most relevant
tool to our project [67]. It provides short code summaries in
multiple programming languages but has limited performance
and interactivity. With our study, we aimed to counter this lack
of IDE features related to code documentation generation.
D. Evaluating Documentation Quality
Empirical research on software documentation quality is
an active field that focuses on various artifacts, like API
reference documentation [40] or README files [68], and
the perspectives of documentation writers [1]. Studies on the
evaluation of AI-generated documentation usually focus on au-
tomated metrics like BLEU,ROUGE, and METEOR [29], [64],
[69]–[71]. Hu et al. compared such automated metrics with
human evaluations on six documentation quality dimensions
[19] and found that automated metrics often misalign with
human judgment [19]. Consequently, our study benefited from
their human-centered metric by capturing nuanced aspects in
the generated code documentation that automated approaches
may have missed.
VIII. CONCLUSION
Recently, many LLMs have emerged, offering the potential
to automate developer tasks, including creating code doc-
umentation. However, it remained unclear how effectively
developers could prompt LLMs to generate useful documen-
tation. We studied the interactions of developers with and
their perception of LLM-powered IDE extensions during a
controlled experiment with professionals and computer science
students. To generate documentation for two Python functions,
the experimental group freely prompted an LLM and the
control group applied a predefined few-shot prompt. Our
results revealed that students, who had relatively low Python
experience, preferred the guidance of the few-shot prompt over
ad-hoc prompting. They rated the documentation generated by
predefined prompts significantly higher in quality, particularly
regarding readability, conciseness, usefulness, and helpfulness.
Professionals were more adept than students at including
Python-specific keywords in their ad-hoc prompts, resulting in
the generation of higher-quality documentation. Consequently,
they enjoyed the flexibility of ad-hoc prompting more than
students, even if they did not apply prompt engineering
techniques. Overall, both types of LLM interactions improved
the code comprehension of the study participants, but the
participants often viewed the generated documentation as an
intermediate result that needed iterative improvement.
We hope our findings encourage researchers to replicate this
study [20] in diverse settings, aiming to improve developers’
prompting skills and AI-powered tool support.
ACK NOW LE DG EM EN T
We thank all participants of our experiment. We acknowl-
edge the support by DASHH (Data Science in Hamburg -
Helmholtz Graduate School for the Structure of Matter) with
the Grant-No. HIDSS-0002.
REFERENCES
[1] E. Aghajani, C. Nagy, O. L. Vega-M´
arquez, M. Linares-V´
asquez,
L. Moreno, G. Bavota, and M. Lanza, “Software documentation issues
unveiled,” in 2019 IEEE/ACM 41st International Conference on Software
Engineering. New York, NY, USA: IEEE, 2019, pp. 1199–1210.
[2] T. Roehm, R. Tiarks, R. Koschke, and W. Maalej, “How do professional
developers comprehend software?” in 2012 34th International Confer-
ence on Software Engineering. New York, NY, USA: IEEE, 2012, pp.
255–265.
[3] I. Steinmacher, C. Treude, and M. A. Gerosa, “Let me in: Guidelines
for the successful onboarding of newcomers to open source projects,”
IEEE Software, vol. 36, no. 4, pp. 41–49, 2019.
[4] C. Stanik, L. Montgomery, D. Martens, D. Fucci, and W. Maalej, “A
simple NLP-based approach to support onboarding and retention in
open source communities,” in 2018 IEEE International Conference on
Software Maintenance and Evolution, 2018, pp. 172–182.
[5] F. Zampetti, G. Fucci, A. Serebrenik, and M. Di Penta, “Self-admitted
technical debt practices: A comparison between industry and open-
source,” Empirical Software Engineering, vol. 26, no. 6, pp. 1–32, 2021.
[6] E. Aghajani, C. Nagy, M. Linares-V´
asquez, L. Moreno, G. Bavota,
M. Lanza, and D. C. Shepherd, “Software documentation: The prac-
titioners’ perspective,” in Proceedings of the ACM/IEEE 42nd Interna-
tional Conference on Software Engineering. New York, NY, USA:
Association for Computing Machinery, 2020, p. 590–601.
[7] S. Rai, R. C. Belwal, and A. Gupta, “A review on source code docu-
mentation,” ACM Transactions on Intelligent Systems and Technology,
vol. 13, no. 5, Jun. 2022.
[8] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka-
mar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T.
Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early
experiments with GPT-4,” 2023.
[9] C. Ebert and P. Louridas, “Generative AI for software practitioners,”
IEEE Software, vol. 40, no. 4, pp. 30–38, 2023.
[10] H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F.
Bissyand´
e, “Is ChatGPT the ultimate programming assistant – how far
is it?” 2023.
[11] M. Wermelinger, “Using GitHub Copilot to solve simple programming
problems,” in Proceedings of the 54th ACM Technical Symposium on
Computer Science Education V. 1. New York, NY, USA: Association
for Computing Machinery, 2023, p. 172–178.
[12] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-
train, prompt, and predict: A systematic survey of prompting methods
in natural language processing,” ACM Computing Surveys, vol. 55, no. 9,
Jan. 2023.
[13] S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li,
A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki,
S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D.
Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker,
D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, and P. Resnik,
“The prompt report: A systematic survey of prompting techniques,”
2024.
[14] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar,
J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to
enhance prompt engineering with ChatGPT,” 2023.
[15] OpenAI. (2024) Prompt engineering. OpenAI, L.L.C. [Online].
Available: https://platform.openai.com/docs/guides/prompt-engineering
[16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and
D. Amodei, “Language models are few-shot learners,” in Advances in
Neural Information Processing Systems, H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Red Hook, NY,
USA: Curran Associates, Inc., 2020, pp. 1877–1901.
[17] R. Logan IV, I. Balazevic, E. Wallace, F. Petroni, S. Singh, and S. Riedel,
“Cutting down on prompts and parameters: Simple few-shot learning
with language models,” in Findings of the Association for Computational
Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds.
Dublin, Ireland: Association for Computational Linguistics, May 2022,
pp. 2824–2835.
[18] T. Ahmed and P. Devanbu, “Few-shot training LLMs for project-
specific code-summarization,” in Proceedings of the 37th IEEE/ACM
International Conference on Automated Software Engineering. New
York, NY, USA: Association for Computing Machinery, 2023.
[19] X. Hu, Q. Chen, H. Wang, X. Xia, D. Lo, and T. Zimmermann,
“Correlating automated and human evaluation of code documentation
generation quality,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 4,
Jul. 2022.
[20] H.-A. Kruse, T. Puhlf¨
urß, , and W. Maalej, “Can developers prompt?
a controlled experiment for code documentation generation [replication
package],” 2024. [Online]. Available: https://zenodo.org/doi/10.5281/
zenodo.13127237
[21] C. Wohlin, P. Runeson, M. H¨
ost, M. C. Ohlsson, B. Regnell, and
A. Wessl´
en, Experimentation in software engineering. Berlin, Hei-
delberg: Springer, 2012.
[22] O. Alhadreti and P. Mayhew, “Rethinking thinking aloud: A comparison
of three think-aloud protocols,” in Proceedings of the 2018 CHI Confer-
ence on Human Factors in Computing Systems. New York, NY, USA:
Association for Computing Machinery, 2018, p. 1–12.
[23] S. Kapil, Clean Python. Berkeley, CA, USA: Apress, 2019.
[24] B. Laugwitz, T. Held, and M. Schrepp, “Construction and evaluation of a
user experience questionnaire,” in HCI and Usability for Education and
Work, A. Holzinger, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg,
2008, pp. 63–76.
[25] R. Macefield, “Usability studies and the Hawthorne effect,” J. Usability
Studies, vol. 2, no. 3, p. 145–154, May 2007.
[26] EuropeanXFEL. (2024) Karabo. GitHub. [Online]. Available: https:
//github.com/European-XFEL/Karabo/
[27] P. Rani, A. Blasi, N. Stulova, S. Panichella, A. Gorla, and O. Nierstrasz,
“A decade of code comment quality assessment: A systematic literature
review,” Journal of Systems and Software, vol. 195, p. 111515, 2023.
[28] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
J. Grundy, and H. Wang, “Large language models for software engi-
neering: A systematic literature review,” 2023.
[29] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng, S. Huang,
Y. Chen, Q. Zhang, H. Qian, Y. Liu, and Z. Chen, “Automatic code
summarization via ChatGPT: How far are we?” 2023.
[30] Z. Renyang. (2023) ChatGPT. Microsoft. [Online]. Available: https://
marketplace.visualstudio.com/items?itemName=zhang-renyang.chat-gpt
[31] Microsoft. (2024) Your first extension. Microsoft. [Online]. Available:
https://code.visualstudio.com/api/get-started/your-first-extension
[32] Re:DevTools. (2024) Code Docs AI. Microsoft. [On-
line]. Available: https://marketplace.visualstudio.com/items?itemName=
re-devtools.code-docs-ai
[33] B. Derrick, D. Toher, and P. White, “Why Welch’s test is Type I error
robust,” The Quantitative Methods for Psychology, vol. 12, no. 1, pp.
30–38, 2016.
[34] N. Nachar, “The mann-whitney u: A test for assessing whether two
independent samples come from the same distribution,” Tutorials in
Quantitative Methods for Psychology, vol. 4, no. 1, pp. 13–20, 2008.
[35] N. Mohd Razali and B. Yap, “Power comparisons of Shapiro-Wilk,
Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests,” Journal
of Statistical Modeling and Analytics, vol. 2, no. 1, pp. 21–33, 2011.
[36] J. Cohen, Statistical Power Analysis for the Behavioral Sciences. Rout-
ledge, 1988.
[37] UEQ-team. (2024) User experience questionnaire. UEQ-team. [Online].
Available: https://www.ueq-online.org
[38] M. A. Cascio, E. Lee, N. Vaudrin, and D. A. Freedman, “A team-
based approach to open coding: Considerations for creating intercoder
consensus,” Field Methods, vol. 31, no. 2, pp. 116–130, 2019.
[39] M. Torricelli, M. Martino, A. Baronchelli, and L. M. Aiello, “The role
of interface design on prompt-mediated creativity in generative ai,” in
Proceedings of the 16th ACM Web Science Conference. New York,
NY, USA: Association for Computing Machinery, 2024, p. 235–240.
[40] W. Maalej and M. P. Robillard, “Patterns of knowledge in API reference
documentation,” IEEE Transactions on Software Engineering, vol. 39,
no. 9, pp. 1264–1282, 2013.
[41] W. Maalej and H.-J. Happel, “A lightweight approach for knowledge
sharing in distributed software teams,” in Practical Aspects of Knowl-
edge Management, T. Yamaguchi, Ed. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2008, pp. 14–25.
[42] W. Maalej, “From RSSE to BotSE: Potentials and challenges revisited
after 15 years,” in 2023 IEEE/ACM 5th International Workshop on Bots
in Software Engineering, 2023, pp. 19–22.
[43] J. T. Liang, C. Yang, and B. A. Myers, “A large-scale survey on the
usability of AI programming assistants: Successes and challenges,” in
Proceedings of the 46th IEEE/ACM International Conference on Soft-
ware Engineering. New York, NY, USA: Association for Computing
Machinery, 2024.
[44] H. Tang and S. Nadi, “Evaluating software documentation quality,” in
2023 IEEE/ACM 20th International Conference on Mining Software
Repositories, 2023, pp. 67–78.
[45] W. Maalej, R. Tiarks, T. Roehm, and R. Koschke, “On the comprehen-
sion of program comprehension,” ACM Trans. Softw. Eng. Methodol.,
vol. 23, no. 4, sep 2014.
[46] D. Fucci, A. Mollaalizadehbahnemiri, and W. Maalej, “On using ma-
chine learning to identify knowledge in API reference documentation,”
in Proceedings of the 2019 27th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations
of Software Engineering. New York, NY, USA: Association for
Computing Machinery, 2019, p. 109–119.
[47] D. Nam, A. Macvean, V. Hellendoorn, B. Vasilescu, and B. Myers,
“Using an LLM to help with code understanding,” in Proceedings of
the IEEE/ACM 46th International Conference on Software Engineering.
New York, NY, USA: Association for Computing Machinery, 2024.
[48] T. Ahmed, K. S. Pai, P. Devanbu, and E. Barr, “Automatic semantic
augmentation of language model prompts (for code summarization),” in
Proceedings of the IEEE/ACM 46th International Conference on Soft-
ware Engineering. New York, NY, USA: Association for Computing
Machinery, 2024.
[49] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,
T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for
programming and natural languages,” in Findings of the Association for
Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu,
Eds. Online: Association for Computational Linguistics, Nov. 2020,
pp. 1536–1547.
[50] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and
X. Liao, “Large language models are few-shot summarizers: Multi-
intent comment generation via in-context learning,” in Proceedings of
the IEEE/ACM 46th International Conference on Software Engineering.
New York, NY, USA: Association for Computing Machinery, 2024.
[51] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi,
Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in
large language models,” in Advances in Neural Information Processing
Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and
A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837.
[52] S. Diao, P. Wang, Y. Lin, and T. Zhang, “Active prompting with chain-
of-thought for large language models,” 2024.
[53] J. Oppenlaender, R. Linder, and J. Silvennoinen, “Prompting AI art: An
investigation into the creative skill of prompt engineering,” 2023.
[54] B. Mesk ´
o, “Prompt engineering as an important emerging skill for
medical professionals: Tutorial,” Journal of Medical Internet Research,
vol. 25, p. e50638, Oct. 2023.
[55] T. F. Heston and C. Khun, “Prompt engineering in medical education,”
International Medical Education, vol. 2, no. 3, pp. 198–205, 2023.
[56] GitHub. (2024) GitHub Copilot. GitHub. [Online]. Available: https:
//github.com/features/copilot
[57] B. Zhang, P. Liang, X. Zhou, A. Ahmad, and M. Waseem, “Demystifying
practices, challenges and expected features of using GitHub Copilot,”
International Journal of Software Engineering and Knowledge Engi-
neering, vol. 33, no. 11n12, pp. 1653–1672, 2023.
[58] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker,
“Towards automatically generating summary comments for Java meth-
ods,” in Proceedings of the 25th IEEE/ACM International Conference
on Automated Software Engineering. New York, NY, USA: Association
for Computing Machinery, 2010, p. 43–52.
[59] P. W. McBurney and C. McMillan, “Automatic documentation genera-
tion via source code summarization of method context,” in Proceedings
of the 22nd International Conference on Program Comprehension.
New York, NY, USA: Association for Computing Machinery, 2014, p.
279–290.
[60] S. Haiduc, J. Aponte, and A. Marcus, “Supporting program compre-
hension with source code summarization,” in Proceedings of the 32nd
ACM/IEEE International Conference on Software Engineering - Volume
2. New York, NY, USA: Association for Computing Machinery, 2010,
p. 223–226.
[61] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment
generation,” in Proceedings of the 26th Conference on Program Compre-
hension. New York, NY, USA: Association for Computing Machinery,
2018, p. 200–210.
[62] U. Alon, O. Levy, and E. Yahav, “code2seq: Generating sequences
from structured representations of code,” in International Conference
on Learning Representations. Online: OpenReview, 2019.
[63] A. LeClair, S. Haque, L. Wu, and C. McMillan, “Improved code
summarization via a graph neural network,” in Proceedings of the 28th
International Conference on Program Comprehension. New York, NY,
USA: Association for Computing Machinery, 2020, p. 184–195.
[64] S. Gao, C. Gao, Y. He, J. Zeng, L. Nie, X. Xia, and M. Lyu, “Code
structure–guided transformer for source code summarization,” ACM
Trans. Softw. Eng. Methodol., vol. 32, no. 1, Feb. 2023.
[65] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu,
“Improving automatic source code summarization via deep reinforce-
ment learning,” in Proceedings of the 33rd ACM/IEEE International
Conference on Automated Software Engineering. New York, NY, USA:
Association for Computing Machinery, 2018, p. 397–407.
[66] W. Wang, Y. Zhang, Z. Zeng, and G. Xu, “TranSˆ3: A transformer-based
framework for unifying code summarization and code search,” 2020.
[67] R. Durelli, V. Durelli, R. Bettio, D. Dias, and A. Goldman, “Divinator:
A visual studio code extension to source code summarization,” in Anais
do X Workshop de Visualizac¸ ˜
ao, Evoluc¸ ˜
ao e Manutenc¸ ˜
ao de Software.
Porto Alegre, RS, Brasil: SBC, 2022, pp. 1–5.
[68] T. Puhlf¨
urß, L. Montgomery, and W. Maalej, “An exploratory study
of documentation strategies for product features in popular GitHub
projects,” in 2022 IEEE International Conference on Software Main-
tenance and Evolution, 2022, pp. 379–383.
[69] S. Aljumah and L. Berriche, “Bi-LSTM-based neural source code
summarization,” Applied Sciences, vol. 12, no. 24, 2022.
[70] Y. Gao and C. Lyu, “M2TS: Multi-scale multi-modal approach based
on transformer for source code summarization,” in Proceedings of the
30th IEEE/ACM International Conference on Program Comprehension.
New York, NY, USA: Association for Computing Machinery, 2022, p.
24–35.
[71] H. Guo, X. Chen, Y. Huang, Y. Wang, X. Ding, Z. Zheng, X. Zhou,
and H.-N. Dai, “Snippet comment generation based on code context
expansion,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 1, Nov.
2023.