Content uploaded by Anjani Dhrangadhariya
Author content
All content in this area was uploaded by Anjani Dhrangadhariya on Dec 27, 2023
Content may be subject to copyright.
JMIR Preprints Dhrangadhariya et al
RoBuster: A Corpus Annotated with Risk of Bias Text
Spans in Randomized Controlled Trials
Anjani Dhrangadhariya, Roger Hilfiker, Karl Martin Sattelmayer, Nona Naderi,
Katia Giacomino, Rahel Caliesch, Julian Higgins, Stéphane Marchand-Maillet,
Henning Müller
Submitted to: Journal of Medical Internet Research
on: December 05, 2023
Disclaimer: © The authors. All rights reserved. This is a privileged document currently under peer-review/community
review. Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for
review purposes only. While the final peer-reviewed paper may be licensed under a CC BY license on publication, at this
stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Table of Contents
Original Manuscript ....................................................................................................................................................................... 5
Supplementary Files ..................................................................................................................................................................... 34
Figures ......................................................................................................................................................................................... 35
Figure 1
...................................................................................................................................................................................... 36
Figure 2
...................................................................................................................................................................................... 37
Figure 3
...................................................................................................................................................................................... 38
Figure 4
...................................................................................................................................................................................... 39
Figure 5
...................................................................................................................................................................................... 40
Figure 6
...................................................................................................................................................................................... 41
Figure 7
...................................................................................................................................................................................... 42
Multimedia Appendixes ................................................................................................................................................................. 43
Multimedia Appendix 1
.................................................................................................................................................................. 44
Multimedia Appendix 2
.................................................................................................................................................................. 44
Multimedia Appendix 3
.................................................................................................................................................................. 44
Multimedia Appendix 4
.................................................................................................................................................................. 44
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
RoBuster: A Corpus Annotated with Risk of Bias Text Spans in
Randomized Controlled Trials
Anjani Dhrangadhariya1, 2, 3 MSc; Roger Hilfiker4 Dr rer nat; Karl Martin Sattelmayer3 Dr rer nat; Nona Naderi5 Dr rer
nat; Katia Giacomino3* MSc; Rahel Caliesch3* MSc; Julian Higgins6 Dr rer nat; Stéphane Marchand-Maillet1 Dr rer
nat; Henning Müller2, 7, 8 Dr rer nat
1Centre d'Informatique Universitaire University of Geneva Geneva CH
2Informatics Institute HES-SO Valais-Wallis Sierre CH
3Institute of Health Sciences HES-SO Valais-Wallis Leukerbad CH
4IUFRS University of Lausanne Lausanne CH
5CNRS Laboratoire Interdisciplinaire des Sciences du Numérique Université Paris-Saclay Orsay FR
6Population Health Sciences Bristol Medical School University of Bristol Bristol GB
7Medical Faculty University of Geneva Geneva CH
8The Sense Innovation and Research Center Sion CH
*these authors contributed equally
Corresponding Author:
Anjani Dhrangadhariya MSc
Informatics Institute
HES-SO Valais-Wallis
Rue du technopole 3
Sierre
CH
Abstract
Background: Risk of bias (RoB) assessment of randomized clinical trials (RCTs) is vital to answering systematic review
questions accurately. Manual RoB assessment for hundreds of RCTs is a cognitively demanding and lengthy process.
Automation has the potential to assist reviewers in rapidly identifying text descriptions in RCTs that indicate potential risks of
bias. However, no RoB text span annotated corpus could be used to fine-tune or evaluate large language models (LLMs), and
there are no established guidelines for annotating the RoB spans in RCTs.
Objective: The revised Cochrane RoB Assessment 2 (RoB 2) tool provides comprehensive guidelines for RoB assessment;
however, due to the inherent subjectivity of this tool, it cannot be directly used as RoB annotation guidelines. Our objective was
to develop precise RoB text span annotation instructions that could address this subjectivity and thus aid the corpus annotation.
Methods: We leveraged RoB 2 guidelines to develop visual instructional placards that serve as text annotation guidelines for
RoB spans and risk judgments. Expert annotators employed these visual placards to annotate a dataset named RoBuster,
consisting of 41 full-text RCTs from the domains of physiotherapy and rehabilitation. We report inter-annotator agreement (IAA)
between two expert annotators for text span annotations before and after applying visual instructions on a subset (9 out of 41) of
RoBuster. We also provide IAA on bias risk judgments using Cohen's Kappa. Moreover, we utilized a portion of RoBuster (10
out of 41) to evaluate an LLM using a straightforward evaluation framework. This evaluation aimed to gauge the performance of
LLM (here GPT 3.5) in the challenging task of RoB span extraction and demonstrate the utility of this corpus using a
straightforward evaluation framework.
Results: We present a corpus of 41 RCTs with fine-grained text span annotations comprising more than 28,427 tokens belonging
to 22 RoB classes. The IAA at the text span level calculated using the F1 measure varies from 0% to 90%, while Cohen's kappa
for risk judgments ranges between -0.235 and 1.0. Employing visual instructions for annotation increases the IAA by more than
17 percent points. LLM (GPT-3.5) shows promising but varied observed agreements with the expert annotation across the
different bias questions.
Conclusions: Despite having comprehensive bias assessment guidelines and visual instructional placards, RoB annotation
remains a complex task. Utilizing visual placards for bias assessment and annotation enhances IAA compared to cases where
visual placards are absent; however, text annotation remains challenging for the subjective questions and the questions for which
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
annotation data is unavailable in RCTs. Similarly, while GPT-3.5 demonstrates effectiveness, its accuracy diminishes with more
subjective RoB questions and low information availability.
(JMIR Preprints 05/12/2023:55127)
DOI: https://doi.org/10.2196/preprints.55127
Preprint Settings
1) Would you like to publish your submitted manuscript as preprint?
Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.
Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.
2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain visible to all users (see Important note, above). I also understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available.
Yes, but only make the title and abstract visible (see Important note, above). I understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Original Manuscript
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Original Paper
MSc. Anjani Kirit Dhrangadhariya1,2,3*, Dr. Roger Hilfiker4, Dr. Karl Martin Sattelmayer3, Dr. Nona
Naderi5, MSc. Katia Giacomino3†, MSc. Rahel Caliesch3†, Dr. Julian PT Higgins6, Prof. Dr. Stéphane
Marchand-Maillet1, Prof. Dr. Henning Müller1,2,7,8
1*Centre d'Informatique Universitaire, University of Geneva, Geneva,
Switzerland.
2Informatics Institute, HES-SO Valais-Wallis, Sierre, Switzerland.
3Institute of Health Sciences, HES-SO Valais-Wallis, Leukerbad,
Switzerland.
4IUFRS, University of Lausanne, Lausanne, Switzerland.
5CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique,
Université Paris-Saclay, Orsay, France.
6Population Health Sciences, Bristol Medical School, University of
Bristol, Bristol, United Kingdom.
7The Sense Innovation and Research Center, Sion and Lausanne, Switzerland
8Medical faculty, University of Geneva, Geneva,
Switzerland.
*Corresponding author(s). E-mail(s): anjani.dhrangadhariya@hevs.ch;
Contributing authors: roger.hilfiker@proton.me;
martin.sattelmayer@hevs.ch; nona.naderi@lisn.fr;
katia.giacomino@hevs.ch; rahel.caliesch@hevs.ch;
julian.higgins@bristol.ac.uk; stephane.marchand-maillet@unige.ch;
henning.mueller@hevs.ch;
†These authors contributed equally to this work.
AKD conceived the presented idea and devised the project. AKD, RH, KMS, KG, RC and NN together
developed the visual annotation guidelines. RH annotated the corpus. KMS annotated the documents
required for measuring inter-annotator agreement. AKD processed the corpus, performed the analysis, drafted
the manuscript, and designed the figures. RH, KMS and NN helped resolve annotation conflicts and improve
the visual annotation guidelines. RH helped perform statistical calculations. JPTH provided critical feedback
on the visual annotation placards. SMM provided guidance on the initial stages of the project and the final
manuscript. HM and RH supervised the project and arranged the funding for this work. All authors provided
critical feedback on the manuscript.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
RoBuster: A Corpus Annotated with Risk of Bias Text Spans in
Randomized Controlled Trials
Abstract
Background: Risk of bias (RoB) assessment of randomized clinical trials (RCTs) is vital to
answering systematic review questions accurately. Manual RoB assessment for hundreds of RCTs is
a cognitively demanding and lengthy process. Automation has the potential to assist reviewers in
rapidly identifying text descriptions in RCTs that indicate potential risks of bias. However, no RoB
text span annotated corpus could be used to fine-tune or evaluate large language models (LLMs), and
there are no established guidelines for annotating the RoB spans in RCTs.
Objective: The revised Cochrane RoB Assessment 2 (RoB 2) tool provides comprehensive
guidelines for RoB assessment; however, due to the inherent subjectivity of this tool, it cannot be
directly used as RoB annotation guidelines. Our objective was to develop precise RoB text span
annotation instructions that could address this subjectivity and thus aid the corpus annotation.
Methods: We leveraged RoB 2 guidelines to develop visual instructional placards that serve as text
annotation guidelines for RoB spans and risk judgments. Expert annotators employed these visual
placards to annotate a dataset named RoBuster, consisting of 41 full-text RCTs from the domains of
physiotherapy and rehabilitation. We report inter-annotator agreement (IAA) between two expert
annotators for text span annotations before and after applying visual instructions on a subset (9 out of
41) of RoBuster. We also provide IAA on bias risk judgments using Cohen's Kappa. Moreover, we
utilized a portion of RoBuster (10 out of 41) to evaluate an LLM using a straightforward evaluation
framework. This evaluation aimed to gauge the performance of LLM (here GPT 3.5) in the
challenging task of RoB span extraction and demonstrate the utility of this corpus using a
straightforward evaluation framework.
Results: We present a corpus of 41 RCTs with fine-grained text span annotations comprising more
than 28,427 tokens belonging to 22 RoB classes. The IAA at the text span level calculated using the
F1 measure varies from 0% to 90%, while Cohen's kappa for risk judgments ranges between -0.235
and 1.0. Employing visual instructions for annotation increases the IAA by more than 17 percent
points. LLM (GPT-3.5) shows promising but varied observed agreements with the expert annotation
across the different bias questions.
Conclusions: Despite having comprehensive bias assessment guidelines and visual instructional
placards, RoB annotation remains a complex task. Utilizing visual placards for bias assessment and
annotation enhances IAA compared to cases where visual placards are absent; however, text
annotation remains challenging for the subjective questions and the questions for which annotation
data is unavailable in RCTs. Similarly, while GPT-3.5 demonstrates effectiveness, its accuracy
diminishes with more subjective RoB questions and low information availability.
Keywords: risk of bias; corpus annotation; natural language processing; large language models;
information extraction
Introduction
Systematic reviews (SRs) synthesized using randomized controlled trials (RCTs) are the highest
quality of evidence in the evidence pyramid. SRs aid medical professionals to make informed
decisions about an individuals' health diagnosis or treatment and help governments enact informed
health policies [ CITATION Mog22 \l 4108 \m McT06]. An RCT is a scientific experiment aiming
to evaluate the effectiveness of an intervention on patient outcomes. In these trials, patients are
randomly divided and allocated to different intervention groups, and the impact of intervention under
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
investigation is compared to other interventions in a controlled setting [ CITATION Sib98 \l 4108 ].
Theoretically, RCTs are low on biases given the randomized study design but are still prone in
practice to unavoidable biases creeping into the trial's design, execution, or reporting. Biased clinical
trials make medical practitioners systematically overestimate or underestimate the intervention effect
on patient outcomes, leading to harmful health practices and policies [ CITATION Kja99 \l 4108 \m
Nac19]. Thus, reviewers conducting SRs must thoroughly screen RCTs for biases.
The biases in RCTs cannot be quantified, but an RCT can be assessed for biases to minimize the
overall risk and judge its quality. In this study, we refer to bias assessment as risk-of-bias (RoB)
assessment. There are several tools to assess RoB, including the Cochrane Collaborations RoB Tool,
Physiotherapy Evidence Database (PEDro) RoB scale, revised Cochrane RoB 2 tool (RoB 2),
AMSTAR/AMSTAR 21, EPOC RoB Tool and several other independent checklists [ CITATION
Hig191 \l 4108 \m Hig11 \m Elk13 \m She17 \m Far19 \m Ste19]. These tools are structured as a
series of questions aiming to elicit factual information from the RCTs, which could then be used for
RoB assessment. Manual RoB assessment requires the reviewers to go through full-text RCTs and
manually inspect every question from the chosen bias assessment tool. It could take anywhere
between a few minutes to a couple of hours to complete bias assessment for individual RCT
depending upon the bias assessment tool and assessor expertise [ CITATION Har11 \l 4108 \m
Cro23]. Moreover, RoB assessment is a part of writing systematic reviews, which is highly resource-
heavy, taking about six months to several years to complete the review [ CITATION Tse15 \l 4108 \
m Kha121 \m Hig191]. The SR writing process takes about 3-10 months per person per SR and
requires a high degree of methodological expertise on the reviewer's part. The pace at which RCTs
are published makes RoB assessment a lengthy process and underscores the need for automation.
Machine learning (ML) can help accelerate the assessment process by directly pointing the reviewers
to the parts of the RCT text relevant to identifying bias, leading to quickly judging the trial quality
[ CITATION Mar151 \l 4108 ]. Marshall et al. attempted automation of RoB assessment using a
distant supervision approach supported by proprietary data from the Cochrane Database of
Systematic Reviews (CDSR). They formulated the trial quality assessment as binary classification
into low-risk and unclear-risk/high-risk quality attributes for each risk domain. The study was
supported by the manually entered data from CDSR, which is behind a paywall and automates based
on Cochrane's RoB 1.0 guidelines and not the latest RoB 2 [ CITATION Hig11 \l 4108 ]. Even
though Cochrane's RoB tool (version 1) is the most frequently used to assess RCT quality, a recently
revised Cochrane RoB 2 offers significant differences in comparison [ CITATION MaL20 \l 4108 ].
Compared to the original RoB version released in 2008, the RoB 2 version provides a more reliable
and concrete structure to the RoB evaluation by developing comprehensive guidelines that aim to
increase consistency [ CITATION Hig11 \l 4108 \m Ste19]. A study analyzing Cochrane systematic
reviews and protocols found that the use of RoB 2 increased from 0% in 2019 to 24.1% in 2022
[ CITATION Mar23 \l 4108 ]. This indicates the importance of using an updated and standardized
tool to assess bias in RCTs.
Millard et al. attempted automating RoB assessment using supervised machine learning trained on
proprietary data as well [ CITATION Mil161 \l 4108 ]. In fact, the research utilizing this pay-walled
data was used to develop RobotReviewer that has been evaluated by several studies for its human-
level performance [ CITATION Mar16 \l 4108 \m Sob19 \m Vin \m Jar22 \m Hir21]. The challenge,
however, remains the unavailability of a publicly available RoB annotated corpus that hinders
community efforts for automation. Wang et al. recently released three RoB annotated datasets, but
for preclinical studies with RoB assessments pertaining to animals [ CITATION Wan221 \l 4108 ]. A
manually annotated corpus of RoB spans for human clinical trials is still necessary. Manual RoB
1 A tool to assess bias risk in SRs.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
assessment is a complex, expert-led task laden with subjective judgements [ CITATION Min221 \l
4108 \m Har09]. Systematically translating this manual process for developing an RoB annotated
corpus requires a carefully designed annotation scheme and detailed annotation guidelines. We
previously worked on a pilot study to test whether RoB 2 guidelines could be effectively utilized as
guidelines to manually annotate a corpus of RCTs with RoB using a multi-level annotation scheme
adapted from the same guidelines. We concluded that the assessment guidelines cannot be used as
text annotation guidelines but did not provide any annotation guidelines from their end. Additionally,
the dataset we provided was comparatively small with 10 annotated RCTs [ CITATION Dhr231 \l
4108 ]. In this work, our objective is to develop clear cut annotation guidelines to annotate RCTs
with RoB spans corresponding to RoB 2 tool for randomized controlled trials [ CITATION Ste19 \l
4108 ].
Recently, large language models (LLMs) have demonstrated exceptional performance on unseen
tasks when only the task instructions are provided [ CITATION Cha23 \l 4108 ]. However, so far no
one has evaluated their performance on the cognitively complex task of identifying RoB text
descriptions from RCTs and providing their RCT quality judgments based on text. Our contributions
with this paper are five-fold.
1) We develop comprehensive annotation guidelines for annotating RCTs with a risk of bias
description.
2) We model these annotation guidelines in form of visual placards for ease of annotation and
understanding. These placards can be used as visual RoB assessment guidelines by the trainee RoB
assessors.
3) We annotate RoBuster, a corpus of 41 full-text RCTs with 22 risk of bias span types that can be
used to fine-tune machine learning models or LLMs and could also be used as a validation
benchmark.
4) We evaluate the performance of LLMs to automatically identify the answers to these signalling
questions (SQs) using prompt generation.
5) We make the visual annotation guidelines, the dataset and LLM prompts openly available for the
scientific community.
Methods
This section describes the annotation scheme, annotation software and visual placard development.
Since there were no text annotation guidelines available for the RoB span annotation task, we had the
pleasure in formulating them from scratch using the revised Cochrane RoB 2 tool [ CITATION
Hig191 \l 4108 ]. We first developed a draft version of our visual annotation guidelines, doubly
annotating a fraction of documents using it and then use the conflicts identified during this exercise
to refine the guidelines.
Annotation Scheme
Creating a new annotated corpus involves defining an annotation scheme or adopting an existing
one. To our knowledge the only available annotation scheme for RoB span annotation was presented
in our previous work [ CITATION Dhr231 \l 4108 ]. We adapt and enhance our previous approach by
learning from the limitations of our previous study rather than developing a new scheme. The
annotation scheme was directly adapted from the RoB 2 assessment procedure and hence it is
imperative to understand the RoB 2 organization to understand the annotation scheme. RoB 2 divides
biases into five risk domains, each loosely corresponding to different parts of the clinical trial design.
Each risk domain decomposes into several SQs, each aiming to prompt the assessor to look for
relevant RCT text evidence and elicit a relevant response for bias risk judgment for that SQ (see
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Table 1).
Table 1. The table lists the bias domains as structured in the revised Cochrane RoB assessment tool and the number of
signalling questions (SQs) in each domain.
Class Domain #SQ
RoB 1 biases arising from the randomization process 3
RoB 2 biases due to deviations from intended interventions 7
RoB 3 bias due to missing outcome data 4
RoB 4 bias in the measurement of the outcome 5
RoB 5 bias in the selection of the reported result 3
The response options for bias risk (or simply risk) judgment are limited to “Yes”, “Probably yes”,
“No”, “Probably no” or “No information”. Reviewers assess these signaling questions (SQs) by
examining the factual evidence in the randomized controlled trial (RCT). For example, to answer the
SQ “Was the allocation sequence random?”, the reviewer reads through the RCT to identify how
participants were randomized into intervention groups. If a well-executed method of randomization
is identified, the reviewer answers with “yes” (indicating that the allocation sequence is random),
judging the risk of bias for this signaling question as low risk. Conversely, if a poorly executed
method of randomization is found, the risk of bias is deemed high risk with the response option “no”
[ CITATION Ste19 \l 4108 ].
In RoB span annotation, we mimic this assessment process by considering evidence text spans in the
RCT as the main units of annotation. Each span corresponds to answering a SQ and is annotated with
the most informative label. In our adopted annotation scheme, the label incorporates information
about the SQ number and the domain it assesses (for the above example, “1.1” for the first domain
and first SQ of the domain). Additionally, the response option or risk judgement is incorporated in
the label, such as “1.1 Yes Good” for a well-executed randomization procedure and “1.1 No Bad”
otherwise (see Figure 1). We previously suggested collapsing the response options “Yes” and
“Probably Yes” into a single “Yes”, and “No” and “Probably No” together into a single “No” to
increase the inter-annotator agreement (IAA) without altering the final risk domain judgment
[ CITATION Dhr231 \l 4108 ]. As shown in Figure 2 responding to any SQ for the risk domain 2 as
either “Probably Yes” or “Yes” does not alter the final risk judgment for this domain (low, high, or
some concerns). Therefore, except for some special case SQs, we collapse these response option in
this work. In this regard, we have a hierarchical span annotation scheme comprising 22 entities
corresponding to the 22 SQs, each with typically two response options (“Yes” or “No”) and two risk
judgments (“Good” and “Bad”). Any label containing “Good” denoted a low risk of bias and any
label containing “Bad” denoted a high risk of bias. We also remove the “No Information” response
option because this was meant for the situations where actually no text evidence is found in the RCT
to answer and label for a SQ. However, for selected SQs (currently only SQ 2.1), “Probably Yes”,
“Probably No” and “No Information” may still be acceptable. For instance, consider that an RCT
uses “...random number generator and sealed envelopes for patient randomization...”, but the trial
provided no information on whether the envelop was “opaque” or not. In such situations, “No
Information” judgment is acceptable.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Expert Team
As mentioned earlier, RoB annotation is a complex task that requires specialized expertise. It is
cognitively demanding due to the need to carefully go through the entire full text of RCTs and
identify 22 different bias categories for annotation. This level of complexity would not be
manageable for annotators without expertise in the field. Our annotation team consisted of two
researchers specializing in RoB assessment in physiotherapy and rehabilitation domains, including
an epidemiology researcher (RH) and an associate professor (KMS) in physiotherapy. With a
substantial background in both physiotherapy, advanced statistical methods and experience writing
SRs, both experts possessed a deep understanding of the complexities involved in bias assessment.
Two additional physiotherapy experts, two senior PhD students (KG and RC), were a part of
developing the visual annotation guidelines. Two additional researchers with expertise in natural
language processing (NLP) were involved, a computational linguistics associate professor (NN) and
a PhD student in computer science (AKD). Their inclusion was important because the guidelines and
placards they helped create will be utilized to annotate a text corpus, serving as a benchmark for RoB
text span extraction. Finally, JPTH provided critical feedback to shape the visual annotation placards.
Data Collection
Different outcome categories exist in SRs: subjective, objective and mortality
outcomes (a sub-category of Objective). Savović et al. found that trials assessing
subjective outcomes are more prone to bias, therefore, had we used only one
outcome type, we would have limited label types for different risk classes
[ CITATION Pag16 \l 4108 \m Sav18]. In context of RCTs, subjective outcomes
are measurements that rely on individuals' perceptions, opinions, or feelings
about their own health or well-being. These outcomes are typically self-reported
by the participants in the trial and can be influenced by factors such as placebo
effects, patient expectations, interpretation, and psychological factors. For
example, in a study on rheumatoid arthritis, subjective outcome measures
included patient-reported pain ratings [ CITATION Vol20 \l 4108 ]. Objective
outcomes are measurements that are independent of individual opinions or
perceptions and are based on observable and measurable data. These outcomes
are typically collected by trained assessors or through laboratory tests, imaging
studies, or other objective methods. For instance, in a study on peripheral artery
disease, objective outcome measures included angiography and molecular
imaging to evaluate the effectiveness of cell therapy [ CITATION Gri16 \l 4108 ].
Mortality outcomes refer to the occurrence of death during the trial. To ensure
that these various outcome types are represented in the corpus, we included 17,
17, and 7 RCTs addressing objective, subjective, and mortality primary outcomes,
respectively. RH created this 41 RCTs dataset from the domain of physiotherapy
and rehabilitation. PDFs of the full-text RCTs were extracted, and each article was
collated with its trial registry from wherever available. Each PDF was renamed
with the primary outcome that was to be examined using RoB 2 before uploading
to the annotation software2. Corpus details are in the Supplementary material.
2 tagtog text annotation software, PAWLS annotation software
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Visual Placards Development
RoB 2 tool consist of an extensive and step-by-step set of instructions to answer signaling questions
and even though RoB 2 guidelines are widely used for bias assessment, there have been some
research on their reliability. This reliability concern has been extensively investigated by Minozzi et
al. [ CITATION Min20 \l 4108 \m Min221]. They formulated specific instructions on how to
approach and answer the signaling questions of RoB 2. These instructions, referred to as the
Instruction Document (ID), address the subjectivity present in the RoB 2 guidelines and provide
clear guidance for the assessment process. Subjectivity in assessment could potentially result in
different evaluators coming to disparate conclusions when analyzing the same trial. Before
implementing the ID, the agreement among four expert RoB assessors was zero, but it improved
after adopting the ID. Several other papers explored subjectivity and reliability of the Cochrane RoB
1.0 and 2 tools [ CITATION Min221 \l 4108 \m Min20 \m daC17 \m Loe22]. With this in mind, we
developed precise and clear text annotation instructions using the RoB 2 tool with an aim to maintain
the consistency and reliability among annotators. Working closely with our team of experts, we
formatted these instructions into visual instructional placards. Each placard takes the form of a
flowchart and provides instructions for annotating RCT text to answer a SQ. The flowchart also
provides instructions on labelling the annotated text with risk judgment. The RoB 2 tool SQs are
broadly factual but leave room for subjective judgements and our visual placards aim to facilitate
judgements about the risk of bias.
Annotation
For every SQ, the annotators were guided to use the complete RoB 2 guidance document along with
visual placards that were developed. They followed these instructions meticulously, going through
each placard's signaling question one by one. The provided instructions directed the annotators to
read specific sections of full-text RCT that needed annotation. Their task involved identifying and
highlighting relevant text related to answering the signaling question. It has to be noted that the
domain 2 of RoB 2 focuses on assessing the risk of bias due to deviation from the intended
intervention. This domain evaluates both the effect of assignment to the intervention and the effect of
adherence to the intervention. RoB 2 offers distinct sets of SQs for each aspect. In our study, we
specifically focus on assessing Domain 2 for the effect of assignment to the intervention. As a result,
we only address the signaling questions corresponding to this aspect.
Tagtog a commercial text annotation web application, allows for annotating PDF (Portable
Document Format) documents, was used for the annotation [ CITATION Cej14 \l 4108 ]. Out of the
41 documents, 9 were doubly annotated by two experienced annotators (RH and KMS) to calculate
inter-annotator agreement (IAA) over these documents and the rest were singly annotated by RH.
After double annotation, we performed conflict resolution to address conflicting annotations, which
helped us further calibrate the visual placards. The conflict resolution was followed by annotating 51
additional RCTs.
After the annotation of 9 doubly annotated RCTs, we switched to the PAWLS annotation tool (see
Figure 3), which allows users to annotate PDFs for free [ CITATION Neu21 \l 4108 ]. We chose to
annotate PDFs rather than plain text because RCT PDFs have a visual format that will be lost upon
converting to text. For example, the structure pertaining to sections and subsections, tables, and
figures makes the annotation task quicker for the annotators and increases annotation quality. Post
annotation, the feedback was taken from both the annotators, details of which could be found in the
supplementary material.
https://preprints.jmir.org/preprint/55127 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Dhrangadhariya et al
Inter-Annotator Agreement
We report IAA at two levels checking whether the annotators agree on the text spans to answer SQs
using the pairwise F1 measure. F1-measure disregards out-of-the-span tokens (unannotated tokens)
during agreement calculation and is an ideal measure of annotation reliability for the token-level
annotation tasks. It measures the F1 score as shown below for each pair of annotators, treating one
annotator's labels as the “true” labels and the other annotator's labels as the “predicted” labels
[ CITATION Del12 \l 4108 \m Bra201].
F1−measure=2× True Positive
2X True Positive+False Positive +False Negative
We also check how strongly the annotators agree on the risk judgment for each SQ using prevalence
and bias adjusted kappa (PABAK)
κ
pabak
and compare it with raw percent agreement. PABAK
κ
pabak
is the standard annotation reliability measure for many classification annotation tasks and is suitable
to measure reliability at the risk judgment level.
κpabak
is an extension of Cohen's Kappa
κ
that
considers prevalence and bias in the agreement. We interpret both the IAA measures as shown in the
Table 2 [CITATION McH12 \m Coh60 \m Bia93 \m Lan77 \l 4108 ].
Table 2. The table details interpretation of pairwise F1-measure (Left),
κ
pabak
(Middle) and observed or raw agreement
(Right)
F1-measure
κpabak
Raw Agreement
Poor 0-0.99 No agreement <=0 None 0
Slight 1-20.99 None to Slight 0.0.20 Very Low 1-10%
Fair 21-40.99 Minimal 0.21-0.39 Low 11-30%
Good 41-60.99 Weak 0.40-0.59 Moderate 31-50%
Substantial 61-80.99 Moderate 0.60-0.79 High 51-70%
Almost Perfect 81-99.99 Strong 0.80-0.90 Very High 71-90%
Perfect 100 Almost Perfect >=0.90 Perfect >90%
Perfect 100
Large Language Model Evaluation
Our annotation guidelines and annotations were adapted for benchmarking supervised machine
learning approaches and not LLMs. So even though we were annotating PDFs, we had to restrict a
lot of annotations based on the assumption that PDF wi