ArticlePDF Available

Procedures for Performing Systematic Reviews



Content may be subject to copyright.
Procedures for Performing
Systematic Reviews
Barbara Kitchenham
Joint Technical Report
Software Engineering Group
Department of Computer Science
Keele University
Keele, Staffs
Keele University Technical Report TR/SE-0401
Empirical Software Engineering
National ICT Australia Ltd.
Bay 15 Locomotive Workshop
Australian Technology Park
Garden Street, Eversleigh
NSW 1430, Australia
NICTA Technical Report 0400011T.1
July, 2004
© Kitchenham, 2004
0. Document Control Section
0.1 Contents
0. Document Control Section......................................................................................i
0.1 Contents ..........................................................................................................i
0.2 Document Version Control.......................................................................... iii
0.3 Executive Summary......................................................................................iv
1. Introduction............................................................................................................1
2. Systematic Reviews...............................................................................................1
2.1 Reasons for Performing Systematic Reviews................................................1
2.2 The Importance of Systematic Reviews ........................................................2
2.3 Advantages and disadvantages ......................................................................2
2.4 Feature of Systematic Reviews......................................................................2
3. The Review Process...............................................................................................3
4. Planning .................................................................................................................3
4.1 The need for a systematic review...................................................................3
4.2 Development of a Review Protocol...............................................................4
4.2.1 The Research Question ..........................................................................5 Question Types ..................................................................................5 Question Structure .............................................................................6 Population....................................................................................6 Intervention..................................................................................6 Outcomes .....................................................................................6 Experimental designs...................................................................7
4.2.2 Protocol Review.....................................................................................7
5. Conducting the review...........................................................................................7
5.1 Identification of Research..............................................................................7
5.1.1 Generating a search strategy ..................................................................7
5.1.2 Publication Bias .....................................................................................8
5.1.3 Bibliography Management and Document Retrieval ............................9
5.1.4 Documenting the Search........................................................................9
5.2 Study Selection ..............................................................................................9
5.2.1 Study selection criteria...........................................................................9
5.2.2 Study selection process........................................................................10
5.2.3 Reliability of inclusion decisions.........................................................10
5.3 Study Quality Assessment ...........................................................................10
5.3.1 Quality Thresholds...............................................................................11
5.3.2 Development of Quality Instruments...................................................15
5.3.3 Using the Quality Instrument...............................................................16
5.3.4 Limitations of Quality Assessment......................................................16
5.4 Data Extraction ............................................................................................17
5.4.1 Design of Data Extraction Forms ........................................................17
5.4.2 Contents of Data Collection Forms......................................................17
5.4.3 Data extraction procedures ..................................................................17
5.4.4 Multiple publications of the same data................................................18
5.4.5 Unpublished data, missing data and data requiring manipulation .......18
5.5 Data Synthesis..............................................................................................18
5.5.1 Descriptive synthesis ...........................................................................19
5.5.2 Quantitative Synthesis .........................................................................19
5.5.3 Presentation of Quantitative Results....................................................20
5.5.4 Sensitivity analysis...............................................................................21
5.5.5 Publication bias....................................................................................21
6. Reporting the review............................................................................................22
6.1 Structure for systematic review ...................................................................22
6.2 Peer Review.................................................................................................22
7. Final remarks .......................................................................................................25
8. References............................................................................................................25
Appendix 1 Steps in a systematic review................................................................27
0.2 Document Version Control
status Version
Number Date Changes from previous version
Draft 0.1 1 April 2004 None
Published 1.0 29 June 2004 Correction of typos
Additional discussion of
problems of assessing evidence
Section 7 “Final Remarks”
Revision 1.1 17 August
2005 Corrections of typos.
0.3 Executive Summary
The objective of this report is to propose a guideline for systematic reviews
appropriate for software engineering researchers, including PhD students. A
systematic review is a means of evaluating and interpreting all available research
relevant to a particular research question, topic area, or phenomenon of interest.
Systematic reviews aim to present a fair evaluation of a research topic by using a
trustworthy, rigorous, and auditable methodology.
The guideline presented in this report was derived from three existing guidelines used
by medical researchers. The guideline has been adapted to reflect the specific
problems of software engineering research.
The guideline covers three phases of a systematic review: planning the review,
conducting the review and reporting the review. It is at a relatively high level. It does
not consider the impact of question type on the review procedures, nor does it specify
in detail mechanisms needed to undertake meta-analysis.
1. Introduction
This document presents a general guideline for undertaking systematic reviews. The
goal of this document is to introduce the concept of rigorous reviews of current
empirical evidence to the software engineering community. It is aimed at software
engineering researchers including PhD students. It does not cover details of meta-
analysis (a statistical procedure for synthesising quantitative results from different
studies), nor does it discuss the implications that different types of systematic review
questions have on systematic review procedures.
The document is based on a review of three existing guidelines for systematic
1. The Cochrane Reviewer’s Handbook [4].
2. Guidelines prepared by the Australian National Health and Medical Research
Council [1] and [2].
3. CRD Guidelines for those carrying out or commissioning reviews [12].
In particular the structure of this document owes much to the CRD Guidelines.
All these guidelines are intended to aid medical researchers. This document attempts
to adapt the medical guidelines to the needs of software engineering researchers. It
discusses a number of issues where software engineering research differs from
medical research. In particular, software engineering research has relatively little
empirical research compared with the large quantities of research available on
medical issues, and research methods used by software engineers are not as rigorous
as those used by medical researchers.
The structure of the report is as follows:
1. Section 2 provides an introduction to systematic reviews as a significant
research method.
2. Section 3 specifies the stages in a systematic review.
3. Section 4 discusses the planning stages of a systematic review
4. Section 5 discusses the stages involved in conducting a systematic review
5. Section 6 discusses reporting a systematic review.
2. Systematic Reviews
A systematic literature review is a means of identifying, evaluating and interpreting
all available research relevant to a particular research question, or topic area, or
phenomenon of interest. Individual studies contributing to a systematic review are
called primary studies; a systematic review is a form of secondary study.
2.1 Reasons for Performing Systematic Reviews
There are many reasons for undertaking a systematic review. The most common
reasons are:
To summarise the existing evidence concerning a treatment or technology e.g. to
summarise the empirical evidence of the benefits and limitations of a specific
agile method.
To identify any gaps in current research in order to suggest areas for further
To provide a framework/background in order to appropriately position new
research activities.
However, systematic reviews can also be undertaken to examine the extent to which
empirical evidence supports/contradicts theoretical hypotheses, or even to assist the
generation of new hypotheses (see for example [10]).
2.2 The Importance of Systematic Reviews
Most research starts with a literature review of some sort. However, unless a literature
review is thorough and fair, it is of little scientific value. This is the main rationale for
undertaking systematic reviews. A systematic review synthesises existing work in
manner that is fair and seen to be fair. For example, systematic reviews must be
undertaken in accordance with a predefined search strategy. The search strategy must
allow the completeness of the search to be assessed. In particular, researchers
performing a systematic review must make every effort to identify and report research
that does not support their preferred research hypothesis as well as identifying and
reporting research that supports it.
2.3 Advantages and disadvantages
Systematic reviews require considerably more effort than traditional reviews. Their
major advantage is that they provide information about the effects of some
phenomenon across a wide range of settings and empirical methods. If studies give
consistent results, systematic reviews provide evidence that the phenomenon is robust
and transferable. If the studies give inconsistent results, sources of variation can be
A second advantage, in the case of quantitative studies, is that it is possible to
combine data using meta-analytic techniques. This increases the likelihood of
detecting real effects that individual smaller studies are unable to detect. However,
increased power can also be a disadvantage, since it is possible to detect small biases
as well as true effects.
2.4 Feature of Systematic Reviews
Some of the features that differentiate a systematic review from a conventional
literature review are:
Systematic reviews start by defining a review protocol that specifies the research
question being addressed and the methods that will be used to perform the review.
Systematic reviews are based on a defined search strategy that aims to detect as
much of the relevant literature as possible.
Systematic reviews document their search strategy so that readers can assess its
rigour and completeness.
Systematic reviews require explicit inclusion and exclusion criteria to assess each
potential primary study.
Systematic reviews specify the information to be obtained from each primary
study including quality criteria by which to evaluate each primary study.
A systematic review is a prerequisite for quantitative meta-analysis.
3. The Review Process
A systematic review involves several discrete activities. Existing guidelines for
systematic reviews have different suggestions about the number and order of activities
(see Appendix 1). This document summarises the stages in a systematic review into
three main phases: Planning the Review, Conducting the Review, Reporting the
The stages associated with planning the review are:
1. Identification of the need for a review
2. Development of a review protocol.
The stages associated with conducting the review are:
1. Identification of research
2. Selection of primary studies
3. Study quality assessment
4. Data extraction & monitoring
5. Data synthesis.
Reporting the review is a single stage phase.
Each phase is discussed in detail in the following sections. Other activities identified
in the guidelines discussed in Appendix 1 are outside the scope of this document.
The stages listed above may appear to be sequential, but it is important to recognise
that many of the stages involve iteration. In particular, many activities are initiated
during the protocol development stage, and refined when the review proper takes
place. For example:
The selection of primary studies is governed by inclusion and exclusion criteria.
These criteria are initially specified when the protocol is defined but may be
refined after quality criteria are defined.
Data extraction forms initially prepared during construction of the protocol will
be amended when quality criteria are agreed.
Data synthesis methods defined in the protocol may be amended once data has
been collected.
The systematic reviews road map prepared by the Systematic Reviews Group at
Berkley demonstrates the iterative nature of the systematic review process very
clearly [15].
4. Planning
4.1 The need for a systematic review
The need for a systematic review arises from the requirement of researchers to
summarise all existing information about some phenomenon in a thorough and
unbiased manner. This may be in order to draw more general conclusion about some
phenomenon than is possible from individual studies, or as a prelude to further
research activities.
Prior to undertaking a systematic review, researchers should ensure that a systematic
review is necessary. In particular, researchers should identify and review any existing
systematic reviews of the phenomenon of interest against appropriate evaluation
criteria. CRC [12] suggests the following checklist:
What are the review’s objectives?
What sources were searched to identify primary studies? Were there any
What were the inclusion/exclusion criteria and how were they applied?
What criteria were used to assess the quality of primary studies and how were
they applied?
How were the data extracted from the primary studies?
How were the data synthesised? How were differences between studies
investigated? How were the data combined? Was it reasonable to combine the
studies? Do the conclusions flow from the evidence?
From a more general viewpoint, Greenlaugh [9] suggests the following questions:
Can you find an important clinical question, which the review addressed?
(Clearly, in software engineering, this should be adapted to refer to an important
software engineering question.)
Was a thorough search done of the appropriate databases and were other
potentially important sources explored?
Was methodological quality assessed and the trials weighted accordingly?
How sensitive are the results to the way that the review has been done?
Have numerical results been interpreted with common sense and due regard to the
broader aspects of the problem?
4.2 Development of a Review Protocol
A review protocol specifies the methods that will be used to undertake a specific
systematic review. A pre-defined protocol is necessary to reduce the possibility
researcher bias. For example, without a protocol, it is possible that the selection of
individual studies or the analysis may be driven by researcher expectations. In
medicine, review protocols are usually submitted to peer review.
The components of a protocol include all the elements of the review plus some
additional planning information:
Background. The rationale for the survey.
The research questions that the review is intended to answer.
The strategy that will be used to search for primary studies including search terms
and resources to be searched, resources include databases, specific journals, and
conference proceedings. An initial scoping study can help determine an
appropriate strategy.
Study selection criteria and procedures. Study selection criteria determine criteria
for including in, or excluding a study from, the systematic review. It is usually
helpful to pilot the selection criteria on a subset of primary studies. The protocol
should describe how the criteria will be applied e.g. how many assessors will
evaluate each prospective primary study, and how disagreements among assessors
will be resolved.
Study quality assessment checklists and procedures. The researchers should
develop quality checklists to assess the individual studies. The purpose of the
quality assessment will guide the development of checklists.
Data extraction strategy. This should define how the information required from
each primary study would be obtained. If the data require manipulation or
assumptions and inferences to be made, the protocol should specify an
appropriate validation process.
Synthesis of the extracted data. This should define the synthesis strategy. This
should clarify whether or not a formal meta-analysis is intended and if so what
techniques will be used.
Project timetable. This should define the review plan.
4.2.1 The Research Question Question Types
The most important activity during protocol is to formulate the research question. The
Australian NHMR Guidelines [1] identify six types of health care questions that can
be addressed by systematic reviews:
1. Assessing the effect of intervention.
2. Assessing the frequency or rate of a condition or disease.
3. Determining the performance of a diagnostic test.
4. Identifying aetiology and risk factors.
5. Identifying whether a condition can be predicted.
6. Assessing the economic value of an intervention or procedure.
In software engineering, it is not clear what the equivalent of a diagnostic test would
be, but the other questions can be adapted to software engineering issues as follows:
Assessing the effect of a software engineering technology.
Assessing the frequency or rate of a project development factor such as the
adoption of a technology, or the frequency or rate of project success or failure.
Identifying cost and risk factors associated with a technology.
Identifying the impact of technologies on reliability, performance and cost
Cost benefit analysis of software technologies.
Medical guidelines often provide different guidelines and procedures for different
types of question. This document does not go to this level of detail.
The critical issue in any systematic review is to ask the right question. In this context,
the right question is usually one that:
Is meaningful and important to practitioners as well as researchers. For example,
researchers might be interested in whether a specific analysis technique leads to a
significantly more accurate estimate of remaining defects after design inspections.
However, a practitioner might want to know whether adopting a specific analysis
technique to predict remaining defects is more effective than expert opinion at
identifying design documents that require re-inspection.
Will lead either to changes in current software engineering practice or to
increased confidence in the value of current practice. For example, researchers
and practitioners would like to know under what conditions a project can safely
adopt agile technologies and under what conditions it should not.
Identify discrepancies between commonly held beliefs and reality.
Nonetheless, there are systematic reviews that ask questions that are primarily of
interest to researchers. Such reviews ask questions that identify and/or scope future
research activities. For example, a systematic review in a PhD thesis should identify
the existing basis for the research student’s work and make it clear where the
proposed research fits into the current body of knowledge. Question Structure
Medical guidelines recommend considering a question from three viewpoints:
The population, i.e. the people affected by the intervention.
The interventions usually a comparison between two or more alternative
The outcomes, i.e. the clinical and economic factors that will be used to compare
the interventions.
In addition, study designs appropriate to answering the review questions may be
identified. Population
In software engineering experiments, the populations might be any of the following:
A specific software engineering role e.g. testers, managers.
A category of software engineer, e.g. a novice or experienced engineer.
An application area e.g. IT systems, command and control systems.
A question may refer to very specific population groups e.g. novice testers, or
experienced software architects working on IT systems. In medicine the populations
are defined in order to reduce the number of prospective primary studies. In software
engineering far fewer primary studies are undertaken, thus, we may need to avoid any
restriction on the population until we come to consider the practical implications of
the systematic review. Intervention
Interventions will be software technologies that address specific issues, for example,
technologies to perform specific tasks such as requirements specification, system
testing, or software cost estimation. Outcomes
Outcomes should relate to factors of importance to practitioners such as improved
reliability, reduced production costs, and reduced time to market. All relevant
outcomes should be specified. For example, in some cases we require interventions
that improve some aspect of software production without affecting another e.g.
improved reliability with no increase in cost.
A particular problem for software engineering experiments is the use of surrogate
measures for example, defects found during system testing as a surrogate for quality,
or coupling measures for design quality. Studies that use surrogate measures may be
misleading and conclusions based on such studies may be less robust. Experimental designs
In medical studies, researches may be able to restrict systematic reviews to primary of
studies of one particular type. For example, Cochrane reviews are usually restricted to
randomised controlled trials (RCTs). In other circumstances, the nature of the
question and the central issue being addressed may suggest that certain studies design
are more appropriate than others. However, this approach can only be taken in a
discipline where the large number of research papers is a major problem. In software
engineering, the paucity of primary studies is more likely to be the problem for
systematic reviews and we are more likely to need protocols for aggregating
information from studies of widely different types. A starting point for such
aggregation is the ranking of primary studies of different types; this is discussed in
Section 5.3.1.
4.2.2 Protocol Review
The protocol is a critical element of any systematic review. Researchers must agree a
procedure for reviewing the protocol. If appropriate funding is available, a group of
independent experts should be asked to review the protocol. The same experts can
later be asked to review the final report.
PhD students should present their protocol to their supervisors for review and
5. Conducting the review
Once the protocol has been agreed, the review proper can start. This involves:
1. Identification of research
2. Selection of studies
3. Study quality assessment
4. Data extraction and monitoring progress
5. Data synthesis
Each of these stages will be discussed in this section. Although some stages must
proceed sequentially, some stages can be undertaken simultaneously.
5.1 Identification of Research
The aim of a systematic review is to find as many primary studies relating to the
research question as possible using an unbiased search strategy. For example, it is
necessary to avoid language bias. The rigour of the search process is one factor that
distinguishes systematic reviews from traditional reviews.
5.1.1 Generating a search strategy
It is necessary to determine and follow a search strategy. This should be developed in
consultation with librarians. Search strategies are usually iterative and benefit from:
Preliminary searches aimed at both identifying existing systematic reviews and
assessing the volume of potentially relevant studies.
Trial searches using various combinations of search terms derived from the
research question
Reviews of research results
Consultations with experts in the field
A general approach is to break down the question into individual facets i.e.
population, intervention, outcomes, study designs. Then draw up a list of synonyms,
abbreviations, and alternative spellings. Other terms can be obtained by considering
subject headings used in journals and data bases. Sophisticated search strings can then
be constructed using Boolean AND’s and OR’s.
Initial searches for primary studies can be undertaken initially using electronic
databases but this is not sufficient. Other sources of evidence must also be searched
(sometimes manually) including:
Reference lists from relevant primary studies and review articles
Journals (including company journals such as the IBM Journal of Research and
Development), grey literature (i.e. technical reports, work in progress) and
conference proceedings
Research registers
The Internet.
It is also important to identify specific researchers to approach directly for advice on
appropriate source material.
Medical researchers have developed pre-packaged research strategies. Software
Engineering Researchers need to develop and publish such strategies including
identification of relevant electronic databases.
5.1.2 Publication Bias
Publication bias refers to the problem that positive results are more likely to be
published than negative results. The concept of positive or negative results sometimes
depends on the viewpoint of the researcher. (For example, evidence that full
mastectomies were not always required for breast cancer was actually an extremely
positive result for breast cancer sufferers). However, publication bias remains a
problem particularly for formal experiments, where failure to reject the null
hypothesis is considered less interesting than an experiment that is able to reject the
null hypothesis.
Publication bias can lead to systematic bias in systematic reviews unless special
efforts are made to address this problem. Many of the standard search strategies
identified above are used to address this issue including:
Scanning the grey literature
Scanning conference proceedings
Contacting experts and researches working in the area and asking them if they
know of any unpublished results.
In addition, statistical analysis techniques can be used to identify the potential
significance of publication bias (se Section 5.5.5).
5.1.3 Bibliography Management and Document Retrieval
Bibliographic packages such as Reference Manager or Endnote are very useful to
manage the large number of reference that can be obtained from a thorough literature
Once reference lists have been finalised the full articles of potentially useful studies
will need to be obtained. A logging system is needed to make sure all relevant studies
are obtained.
5.1.4 Documenting the Search
The process of performing a systematic review must be transparent and replicable:
The review must be documented in sufficient detail for readers to be able to
assess the thoroughness of the search.
The search should be documented as it occurs and changes noted and justified.
The unfiltered search results should be saved and retained for possible reanalysis.
Procedures for documenting the search process are given in Table 1.
Table 1 Search process documentation
Data Source Documentation
Electronic database Name of database
Search strategy for each database
Date of search
Years covered by search
Journal Hand Searches Name of journal
Years searched
Any issues not searched
Conference proceedings Title of proceedings
Name of conference (if different)
Title translation (if necessary)
Journal name (if published as part of a journal)
Efforts to identify
unpublished studies Research groups and researchers contacted (Names and contact details)
Research web sites searched (Date and URL)
Other sources Date Searched/Contacted
Any specific conditions pertaining to the search
5.2 Study Selection
Once the potentially relevant primary studies have been obtained, they need to be
assessed for their actual relevance.
5.2.1 Study selection criteria
Study selection criteria are intended to identify those primary studies that provide
direct evidence about the research question. In order to reduce the likelihood of bias,
selection criteria should be decided during the protocol definition.
Inclusion and exclusion criteria should be based on the research question. They
should be piloted to ensure that they can be reliably interpreted and that they classify
studies correctly.
It is important to avoid, as far as possible, exclusions based on the language of the
primary study. It is often possible to cope with French or German abstracts, but
Japanese or Chinese papers are often difficult to access unless they have a well-
structured English abstract.
It is possible that inclusion decisions could be affected by knowledge of the
authors, institutions, journals or year of publication. Some medical researchers
have suggested reviews should be done after such information has been removed.
However, it takes time to do this and experimental evidence suggests that
masking the origin of primary studies does not improve reviews [3].
5.2.2 Study selection process
Study selection is a multistage process. Initially, selection criteria should be
interpreted liberally, so that unless studies identified by the electronic and hand
searchers can be clearly excluded based on titles and abstracts, full copies should be
Final inclusion/exclusion decisions should be made after the full texts have been
retrieved. It is useful to maintain a list of excluded studies identifying the reason for
5.2.3 Reliability of inclusion decisions
When two or more researchers assess each paper, agreement between researchers can
be measured using the Cohen Kappa statistic [6]. Each disagreement must be
discussed and resolved. This may be a matter of referring back to the protocol or may
involve writing to the authors for additional information. Uncertainty about the
inclusion/exclusion of some studies should be investigated by sensitivity analysis.
A single researcher (such as a PhD student) should consider discussing included and
excluded papers with an expert panel.
5.3 Study Quality Assessment
In addition, to general inclusion exclusion criteria, it is generally considered important
to assess the “quality” of primary studies:
To provide still more detailed inclusion/exclusion criteria.
To investigate whether quality differences provide an explanation for differences
in study results.
As a means of weighting the importance of individual studies when results are
being synthesised.
To guide the interpretation of findings and determine the strength of inferences.
To guide recommendations for further research.
An initial difficulty is that there is no agreed definition of study “quality”. However,
the CRD Guidelines [12] and the Cochrane Reviewers’ Handbook [4] both suggest
that quality relates to the extent to which the study minimises bias and maximises
internal and external validity (see Table 2).
Table 2 Quality concept definitions
Term Synonyms Definition
Bias Systematic error A tendency to produce results that depart systematically
from the ‘true’ results. Unbiased results are internally valid
Internal validity Validity The extent to which the design and conduct of the study are
likely to prevent systematic error. Internal validity is a
prerequisite for external validity.
External validity Generalisability,
Applicability The extent to which the effects observed in the study are
applicable outside of the study.
5.3.1 Quality Thresholds
The CRD Guideline [4] suggests using an assessment of study design to guarantee a
minimum level of quality. The Australian National Health and Medical Research
Council guidelines [2] suggest that study design is considered during assessment of
evidence rather than during the appraisal and selection of studies. Both groups
however suggest a hierarchy of study designs (see Table 3 and Table 4).
Table 3 CRD Hierarchy of evidence
Level Description
1 Experimental studies (i.e. RCT with concealed allocation)
2 Quasi-experimental studies (i.e. studies without randomisation)
3 Controlled observational studies
3a Cohort studies
3b Case control studies
4 Observational studies without control groups
5 Expert opinion based on theory, laboratory research or consensus
Table 4 Australian NHMRC Study design hierarchy
Level I Evidence obtained from a systematic review of all relevant randomised trials
Level II Evidence obtained from at least one properly-designed randomised controlled trial
Level III-1 Evidence obtained from well-designed pseudo-randomised controlled trials (i.e. non-
random allocation to treatment)
Level III-2 Evidence obtained from comparative studies with concurrent controls and allocation
not randomised, cohort studies, case-control studies or interrupted time series with a
control group.
Level III-3 Evidence obtained from comparative studies with historical control, two or more
single arm studies, or interrupted time series without a parallel control group
Level IV Evidence obtained from case series, either post-test or pretest/post-test
In order to understand Table 3 and Table 4 some additional definitions of studies
types is given in Table 5, where experimental studies are those in which some
conditions, particularly those concerning the allocation of participants to different
treatment groups are under the control of investigator and observational studies are
those in which uncontrolled variation in treatment or exposure among study
participants is investigated.
Although the definitions given in Table 5 appear appropriate to software engineering
studies (replacing the word disease with condition), it is important to note one critical
difference between medical experiments and software engineering experiments. Most
experiments performed in academic settings cannot be equated to randomised
controlled trials (RCTs) in medicine.
Table 5 Definition of study designs
Type Synonym Basic Type Definition Source
Trial (RCT)
Clinical Trial Experiment An experiment in which investigators
randomly allocate eligible people into
intervention groups
controlled trial
Experiment A study in which the allocation of
participants to different intervention
groups is controlled by the investigator but
the method falls short of genuine
randomisation and allocation concealment.
study Follow-up
Observation An observational study in which a defined
group of people (the cohort) is followed
over time. The outcomes of people in
subsets are compared to examine for
example people who were exposed to or
not exposed (or exposed at different levels)
to a particular intervention.
cohort study Observation A study where a cohort is assembled in the
present and followed into the future [12]
cohort study Observation A study where a cohort is identified from
past records and followed from that time
to the present.
study Observation Subjects with the outcome or disease and
an appropriate group of controls without
the outcome or disease are selected and
information is obtained about the previous
exposure to the treatment or other factor
being studied
control Observation Outcomes for a prospectively collected
group of subjects exposed to a new
treatment/intervention are compared with
either a previously published series or
previously treated subjects at the same
time series Observation Trends in the outcomes or diseases are
compared over multiple time points before
and after introduction of the
treatment/intervention or other factor being
Observation Examination of relationships between
diseases and other variables of interest as
they exist in a defined population at one
particular time
Case series Observation A group of subjects are exposed to the
treatment or intervention [2]
case series Observation A case series where only outcomes after
the intervention are recorded in the case
series, so no comparisons can be made.
Pre-test /
case series
after study Observation A case series where outcomes are
measured in subjects before and after
exposure to the treatment/intervention for
RCTs involve real patients with real diseases receiving a new treatment to manage
their condition. That is, RCTs are trials of treatment under its actual use conditions.
The majority of academic experiments in Software Engineering involve students
doing constrained tasks in artificial environments. Thus, the major issue for software
engineering study hierarchies is whether small-scale experiments are considered the
equivalent of laboratory experiments and evaluated at the lowest level of evidence, or
whether they should be ranked higher. In my opinion, they should be ranked higher
than expert opinion. I would consider them equivalent in value to case series or
observational studies without controls. Two other issues that need to be resolved are:
Whether or not systematic reviews are included in the hierarchy.
Whether or not expert opinion is included in the hierarchy.
The inclusion of systematic reviews depends on whether you are classifying
individual studies or assessing the level of evidence. For assessing individual primary
studies, systematic reviews are, of course, excluded. For assessing the level of
evidence, systematic reviews should be considered the highest level of evidence.
However, in contrast to the implication in the Australian Hierarchy in Table 4, I
believe software engineers must consider systematic reviews of many types of
primary study, not only randomised controlled trials.
The Australian NHMRC guidelines [2] do not include expert opinion in their
hierarchy. The authors remark that the exclusion is a result of studies identifying the
fallibility of expert opinion. In software engineering we may have little empirical
evidence, so may have to rely more on expert opinion than medical researchers.
However, we need to recognise the weakness of such evidence.
Table 6 Study design hierarchy for Software Engineering
1 Evidence obtained from at least one properly-designed randomised controlled trial
2 Evidence obtained from well-designed pseudo-randomised controlled trials (i.e. non-
random allocation to treatment)
3-1 Evidence obtained from comparative studies with concurrent controls and allocation
not randomised, cohort studies, case-control studies or interrupted time series with a
control group.
3-2 Evidence obtained from comparative studies with historical control, two or more
single arm studies, or interrupted time series without a parallel control group
4-1 Evidence obtained from a randomised experiment performed in an artificial setting
4-2 Evidence obtained from case series, either post-test or pre-test/post-test
4-3 Evidence obtained from a quasi-random experiment performed in an artificial setting
5 Evidence obtained from expert opinion based on theory or consensus
These considerations lead to the hierarchy shown in Table 6 for Software
Engineering. Studies. This table includes reference to randomised controlled trials
although I am aware of only one software engineering experiment that comes
anywhere close to a randomised controlled trial in the sense that it undertakes an
experiment in a real-life situation [11]. In this study, Jørgensen and Carelius requested
a bid for a real project from a large number of commercial software companies in
Norway. Companies were selected using stratified random sampling. Once the full
sample was obtained, companies were randomly assigned to two groups. One group
of companies were involved in pre-study phase and the bidding phase, the other
companies were only involved in the bidding phase. The treatment in this case, was
the pre-study activity, which involved companies providing an initial non-binding
preliminary bid. One aspect that is not consistent with an RCT is that companies were
paid for their time in order to compensate them for providing additional information
to the experimenters. In addition, the study was not aimed at formal hypothesis
testing, so the outcome was a possible explanatory theory rather than a statement of
expected treatment effect.
Normally, primary study hierarchies are used to set a minimum requirement on the
type of study included in the systematic review. In software engineering, we will
usually accept all levels of evidence. The only threshold that might be viable would
be to exclude level 5 evidence when there are a reasonable number of primary studies
at a greater level (where a reasonable number must be decided by the researchers, but
should be more than 2).
Categorising evidence hierarchies does not by itself solve the problem of how to
accumulate evidence from studies in different categories. We discuss some fairly
simple ideas in Section 5.5.4 used to present evidence, but we may need to identify
new methods of accumulating evidence from different types of study. For example,
Hardman and Ayton discuss a system to allow the accumulation of qualitative as well
as quantitative evidence in the form arguments that are for or against proposition [13].
In addition, we need better understand the strength of evidence from different types of
study. However, this is difficult. For example, there is no agreement among medical
practitioners of the extent to which results from observational studies can really be
trusted. Some medical researchers are critical of the reliance on RCTs and report
cases where observational studies produced almost identical results to RCTs [8].
Concato and Horowitz suggest that improvements in reporting clinical conditions (i.e.
collecting more information about individual patients and the reasons for assigning
the patient to a particular treatment) would make observational studies as reliable as
RCTs [7]. In contrast, Lawlor et al. discuss an example where results of a RCT proved
that observational studies were incorrect [14]. Specifically, beneficial effects of
vitamins in giving protection against heart disease found in two observational studies
could not be detected in a randomised controlled trial. They suggest that better
identification and adjustment for possible confounding factors would improve the
reliability of observational studies. In addition, Vandenbroucke suggests that
observational studies are appropriate for detecting negative side-effects, but not
positive side-effects of treatments [17].
Observational studies and experiments in software engineering often have more in
common with studies in the social sciences than medicine. For example, both social
science and software engineering struggle with the problems both of defining and
measuring constructs of interest, and of understanding the impact of experimental
context on study results. From the viewpoint of social science, Shadish et al. provide a
useful discussion of the study design and analysis methods that can improve the
validity of experiments and quasi-experiments [16]. They emphasise the importance
of identifying and either measuring or controlling confounding factors. They also
discuss threats to validity across all elements of a study i.e. subjects, treatments,
observations and settings.
5.3.2 Development of Quality Instruments
Once the primary studies have been selected a more detailed quality assessment needs
to be made. This allows researchers to assess differences in the executions of studies
within design categories. This information is important for data synthesis and
interpretation of results. Detailed quality assessments are usually based on “quality
instruments” which are checklists of factors that need to be assessed for each study. If
quality items within a checklist are assigned numerical scales numerical assessments
of quality can be obtained.
Checklists are usually derived from a consideration of factors that could bias study
results. The CRD Guidelines [12], the Australian National Health and Medical
Research Council Guidelines [1], and the Cochrane Reviewers’ Handbook [4] all refer
to four types of bias shown in Table 7. (I have amended the definitions (slightly) and
protection mechanisms (considerably) to address software engineering rather than
medicine.) In particular, medical researchers rely on “blinding” subjects and
experimenters (i.e. making sure that neither the subject nor the researcher knows
which treatment a subject is assigned to) to address performance and measurement
bias. However, that protocol is often impossible for software engineering
Table 7 Types of Bias
Type Synonyms Definition Protection mechanism
bias Allocation
bias Systematic difference between
comparison groups with respect to
Randomisation of a large number
of subjects with concealment of
the allocation method (e.g.
allocation by computer program
not experimenter choice).
bias Systematic difference is the
conduct of comparison groups
apart from the treatment being
Replication of the studies using
different experimenters.
Use of experimenters with no
personal interest in either
bias Detection
Bias Systematic difference between the
groups in how outcomes are
Blinding outcome assessors to the
treatments is sometimes possible.
Attrition bias Exclusion
bias Systematic differences between
comparison groups in terms of
withdrawals or exclusions of
participants from the study
Reporting of the reasons for all
withdrawals. Sensitivity analysis
including all excluded participants.
The factors identified in Table 7 are refined into a quality instrument by considering:
Generic items that relate to features of particular study designs such as lack of
appropriate blinding, unreliable measurement techniques, inappropriate selection
of subjects, and inappropriate statistical analysis.
Specific items that relate to the review’s subject area such as use of outcome
measures inappropriate for answering the research question.
More detailed discussion of bias (or threats to validity) from the viewpoint of the
social sciences rather than medicine can be found in Shadish et al. [16].
Examples of generic quality criteria for several types of study design are shown in
Table 8. The items were derived from lists in [2] and [12].
If required, researchers may construct a measurement scale for each item. Whatever
form the quality instrument takes, it should be assessed for reliability and usability in
a pilot project before being applied to all the selected studies.
5.3.3 Using the Quality Instrument
Quality appraisal of each primary study allows researchers to group studies by quality
prior to any synthesis of results. Researchers can then investigate whether there are
systematic differences between primary studies in different quality groups.
Some researchers have suggested weighting results using quality scores. This idea is
not recommended by any of the medical guidelines.
5.3.4 Limitations of Quality Assessment
Primary studies are often poorly reported, so it may not be possible to determine how
to assess a quality criterion. It is possible to assume that because something wasn’t
reported, it wasn’t done. This assumption may be incorrect. Researchers should
attempt to obtain more information from the authors of the study.
Table 8 Example of Quality Criteria
Study type Quality criteria
Cohort studies How were subjects chosen for the new intervention?
How were subjects selected for the comparison or control?
Were drop-out rates and reasons for drop-out similar across intervention and
unexposed groups?
Does the study adequately control for demographic characteristics, and other
potential confounding variables in the design or analysis?
Was the measurement of outcomes unbiased (i.e. blinded to treatment group and
comparable across groups)?
Were there exclusions from the analysis?
studies How were cases defined and selected?
How were controls defined and selected? (I.e. were they randomly selected from
the source population of the cases)
How comparable are the cases and the controls with respect to potential
confounding factors?
Does the study adequately control for demographic characteristics, and other
potential confounding variables in the design or analysis?
Was measurement of the exposure to the factor of interest adequate and kept
blinded to the case/control status?
Were all selected subjects included in the analysis?
Were interventions and other exposures assessed in the same way for cases and
Was an appropriate statistical analysis used (i.e. matched or unmatched)?
Case series Is the study based on a representative sample from a relevant population?
Are criteria for inclusion explicit?
Were outcomes assessed using objective criteria?
There is limited evidence of relationships between factors that are thought to affect
validity and actual study outcomes. Evidence suggests that inadequate concealment of
allocation and lack of double-blinding result in over-estimates of treatment effects,
but the impact of other quality factors is not supported by empirical evidence.
It is possible to identify inadequate or inappropriate statistical analysis, but without
access to the original data it is not possible to correct the analysis. Very often software
data is confidential and cannot therefore be made available to researchers. In some
cases, software engineers may refuse to make their data available to other researchers
because they want to continue publishing analyses of the data.
5.4 Data Extraction
The objective of this stage is to design data extraction forms to accurately record the
information researchers obtain from the primary studies. To reduce the opportunity
for bias, data extraction forms should be defined and piloted when the study protocol
is defined.
5.4.1 Design of Data Extraction Forms
The data extraction forms must be designed to collect all the information needed to
address the review questions and the study quality criteria. They must also collect all
data items specified in the review synthesis strategy section of the protocol.
In most cases, data extraction will define a set of numerical values that should be
extracted for each study (e.g. number of subjects, treatment effect, confidence
intervals, etc.). Numerical data are important for any attempt to summarise the results
of a set of primary studies and are a prerequisite for meta-analysis (i.e. statistical
techniques aimed at integrating the results of the primary studies).
Data extraction forms need to be piloted on a sample of primary studies. If several
researchers will use the forms, several researchers should take part in the pilot. The
pilot studies are intended to assess both technical issues such as the completeness of
the forms and usability issues such as the clarity of user instructions and the ordering
of questions.
Electronic forms are useful and can facilitate subsequent analysis.
5.4.2 Contents of Data Collection Forms
In addition, to including all the questions needed to answer the review question and
quality evaluation criteria, data collection forms should provide standard information
Name of Review
Date of Data extraction
Title, authors, journal, publication details
Space for additional notes
5.4.3 Data extraction procedures
Whenever feasible, data extraction should be performed independently by two or
more researchers. Data from the researchers must be compared and disagreements
resolved either by consensus among researchers or arbitration by an additional
independent researcher. Uncertainties about any primary sources for which agreement
cannot be reached should be investigated as part of any sensitivity analyses. A
separate form must be used to mark and correct errors or disagreements.
If several researchers each review different primary studies because time or resource
constraints prevent all primary papers being assessed by at least two researchers, it is
important to ensure employ some method of checking that researchers extract data in
a consistent manner. For example, some papers should be reviewed by all researchers
(e.g. a random sample of primary studies), so that inter-researcher consistency can be
For single researchers such as PhD students, other checking techniques must be used,
for example supervisors should be asked to perform data extraction on a random
sample of the primary studies and their results cross-checked with those of the
5.4.4 Multiple publications of the same data
It is important to avoid including multiple publications of the same data in a
systematic review synthesis because duplicate reports would seriously bias any
results. It may be necessary to contact the authors to confirm whether or not reports
refer to the same study. When there are duplicate publications, the most recent should
be used.
5.4.5 Unpublished data, missing data and data requiring
If information is available from studies in progress, it should be included providing
appropriate quality information about the study can be obtained and written
permission is available from the researchers.
Reports do not always include all relevant data. They may also be poorly written and
ambiguous. Again the authors should be contacted to obtain the required information.
Sometimes primary studies do not provide all the data but it is possible to recreate the
required data by manipulating the published data. If any such manipulations are
required, data should first be reported in the way they were reported. Data obtained by
manipulation should be subject to sensitivity analysis.
5.5 Data Synthesis
Data synthesis involves collating and summarising the results of the included primary
studies. Synthesis can be descriptive (non-quantitative). However, it is sometimes
possible to complement a descriptive synthesis with a quantitative summary. Using
statistical techniques to obtain a quantitative synthesis is referred to as meta-analysis.
Description of meta-analysis methods is beyond the scope of this document, although
techniques for displaying quantitative results will be described. (To learn more about
meta-analysis see [4].)
The data synthesis activities should be specified in the review protocol. However,
some issues cannot be resolved until the data is actually analysed, for example, subset
analysis to investigate heterogeneity is not required if the results show no evidence of
5.5.1 Descriptive synthesis
Extracted information about the studies (i.e. intervention, population, context, sample
sizes, outcomes, study quality) should be tabulated in a manner consistent with the
review question. Tables should be structured to highlight similarities and difference
between study outcomes.
It is important to identify whether results from studies are consistent one with another
(i.e. homogeneous) or inconsistent (e.g. heterogeneous). Results may be tabulated to
display the impact of potential sources of heterogeneity, e.g. study type, study quality,
and sample size.
Quantitative data should also be presented in tabular form including:
Sample size for each intervention
Estimates effect size for each intervention with standard errors for each effect
Difference between the mean values for each intervention, and the confidence
interval for the difference.
Units used for measuring the effect.
5.5.2 Quantitative Synthesis
To synthesis quantitative results from different studies, study outcomes must be
presented in a comparable way. Medical guidelines suggest different effect measures
for different types of outcome.
Binary outcomes (Yes/No, Success/Failure) can be measured in several different
Odds. The ratio of the number of subjects in a group with an event to the number
without an event. Thus if 20 projects in a group of 100 project failed to achieve
budgetary targets, the odds would be 20/80 or 0.25.
Risk (proportion, probability, rate) The proportion of subjects in a group observed
to have an event. Thus, if 20 out of 100 projects failed to achieve budgetary
targets, the risk would be 20/100 or 0.20.
Odds ratio (OR). The ratio of the odds of an event in the experimental (or
intervention) group to the odds of an event on the control group. An OR equal to
one indicates no difference between the control and the intervention group. For
undesirable outcomes a value less than one indicates that the intervention was
successful in reducing risk, for a desirable outcome a value greater than one
indicates that the intervention was successful in reducing risk.
Relative risk (RR) (risk ratio, rate ratio). The ratio of risk in the intervention
group to the risk in the control group. An RR of one indicates no difference
between comparison groups. For undesirable events an RR less than one indicates
the intervention was successful, for desirable events an RR greater than one
indicates the intervention was successful.
Absolute risk reduction (ARR) (risk difference, rate difference). The absolute
difference in the event rate between the comparison groups. A difference of zero
indicates no difference between the groups. For an undesirable outcome an ARR
less than zero indicates a successful intervention, for a desirable outcome an ARR
greater than zero indicates a successful intervention.
Each of these measures has advantages and disadvantages. For example, odds and
odds ratios are criticised for not being well-understood by non-statisticians (other than
gamblers), whereas risk measures are generally easier to understand. Alternatively
statisticians prefer odds ratios because they have some mathematically desirable
properties. Another issue is that relative measures are generally more consistent than
absolute measures for statistical analysis, but decision makers need absolute values in
order to assess the real benefit of an intervention.
Effect measures for continuous data include:
Mean difference. The difference between the means of each group (control and
intervention group).
Weighted mean difference (WMD). When studies have measured the difference
on the same scale, the weight give to each study is usually the inverse of the study
Standardised mean difference (SMD). A common problem when summarising
outcomes is that outcomes are often measured in different ways, for example,
productivity might be measured in function points per hour, or lines of code per
day. Quality might be measured as the probability of exhibiting one or more faults
or the number of faults observed. When studies use different scales, the mean
difference may be divided by an estimate of the within-groups standard deviation
to produce a standardised value without any units. However, SMDs are only valid
if the difference in the standard deviations reflect differences in the measurement
scale, not real differences among trial populations.
5.5.3 Presentation of Quantitative Results
The most common mechanism for presenting quantitative results is a forest plot, as
shown in Figure 1. A forest plot presents the means and variance for the difference for
each study. The line represents the standard error of the difference, the box represents
the mean difference and its size is proportional to the number of subjects in the study.
A forest plot may also be annotated with the numerical information indicating the
number of subjects in each group, the mean difference and the confidence interval on
the mean. If a formal meta-analysis is undertaken, the bottom entry in a forest plot
will be the summary estimate of the treatment difference and confidence interval for
the summary difference.
Figure 1 represents the ideal result of a quantitative summary, the results of the
studies basically agree. There is clearly a genuine treatment effect and a single overall
summary statistics would be a good estimate of that effect. If effects were very
different from study to study, our results would suggest heterogeneity. A single
overall summary statistics would probably be of little value. The systematic review
should continue with an investigation of the reasons for heterogeneity. To avoid the
problems of post-hoc analysis, researchers should identify possible sources of
heterogeneity when they construct the review protocol.
Figure 1 Example of a forest plot
Study 1
Study 2
Study 3
-0.2 -0.1 0 0.1 0.2
Favours control Favours intervention
5.5.4 Sensitivity analysis
Sensitivity analysis is much more important when a full meta-analysis is performed
than when no formal meta-analysis is performed. Meta-analysis is used to provide an
overall estimate of the treatment effect and its variability. In such cases, the results of
the analysis should be repeated on various subsets of primary studies to determine
whether the results are robust. The types of subsets selected would be:
High quality primary studies only.
Primary studies of particular types.
Primary studies for which data extraction presented no difficulties (i.e. excluding
any studies where there was some residual disagreement about the data extracted).
When a formal meta-analysis is not undertaken, forest plots can be annotated to
identify high quality primary studies, the studies can be presented in decreasing order
of quality or in decreasing study type hierarchy order. Primary studies where there are
queries about the data extracted can also be explicitly identified on the forest plot, by
for example, using grey colouring for less reliability studies and black colouring for
reliable studies.
5.5.5 Publication bias
Funnel plots are used to assess whether or not a systematic review is likely to be
vulnerable to publication bias. Funnel plots plot the treatment effect (i.e. mean
difference between intervention group and control) against the inverse of the variance
or the sample size. A systematic review that exhibited the funnel shape shown in
Figure 2 would be assumed not to be exhibiting evidence of publication bias. It would
be consistent with studies based on small samples showing more variability in
outcome than studies based on large samples. If, however, the points shown as filled-
in black dots were not present, the plot would be asymmetric and it would suggest the
presence of publication bias. This would suggest the results of the systematic survey
must be treated with caution.
Treatment effect
Journal articles will be peer reviewed as a matter of course. Experts review PhD
theses as part of the examination process. In contrast, technical reports are not usually
subjected to peer review. However, if systematic reviews are made available on the
Web so that results are made available quickly to researchers and practitioners, it is
worth organising a peer review. If an expert panel were assembled to review the study
protocol, the same panel would be appropriate to undertake peer review of the
systematic review report.
Figure 2 An example of a funnel plot
6. Reporting the review
It is important to communicate the results of a systematic review effectively. Usually
systematic reviews will be reported in at least two formats:
1. In a technical report or in a section of a PhD thesis.
2. In a journal or conference paper.
A journal or conference paper will normally have a size restriction. In order to ensure
that readers are able to properly evaluate the rigour and validity of a systematic
review, journal papers should reference a technical report or thesis that contains all
the details.
In addition, systematic reviews with important practical results may be summarised in
non-technical articles in practitioner magazines, in press releases and in Web pages.
6.2 Peer Review
The structure and contents of reports suggested in [12] is presented in Table 9. This
structure is appropriate for technical reports and journals. For PhD theses, the entries
marked with an asterisk are not likely to be relevant.
6.1 Structure for systematic review
Table 9 Structure and contents of reports of systematic reviews
Section Subsection Scope Comments
Title* The title should be short but informative. It should be based on the question
being asked. In journal papers, it should indicate that the study is a
systematic review.
Authorship* When research is done collaboratively, criteria for determining both who
should be credited as an author, and the order of author’s names should be
defined in advance. The contribution of workers not credited as authors
should be noted in the Acknowledgements section.
The importance of the research
questions addressed by the review
The questions addressed by the
systematic review
Methods Data Sources, Study selection, Quality
Assessment and Data extraction
Results Main finding including any meta-
analysis results and sensitivity
Executive summary
or Structured
Conclusions Implications for practice and future
A structured summary or abstract allows readers to assess quickly the
relevance, quality and generality of a systematic review.
Background Justification of the need for the review.
Summary of previous reviews Description of the software engineering technique being investigated and its
potential importance
Review questions Each review question should be
specified Identify primary and secondary review questions. Note this section may be
included in the background section.
Data sources and search
Study selection
Study quality assessment
Data extraction
Review Methods
Data synthesis
This should be based on the research protocol. Any changes to the original
protocol should be reported.
Included and
excluded studies Inclusion and exclusion criteria
List of excluded studies with rationale
for exclusion
Study inclusion and exclusion criteria can sometimes best be represented as a
flow diagram because studies will be excluded at different stages in the
review for different reasons.
Findings Description of primary studies
Results of any quantitative summaries
Details of any meta-analysis
Sensitivity analysis
Non-quantitative summaries should be provided to summarise each of the
studies and presented in tabular form.
Quantitative summary results should be presented in tables and graphs
Discussion Principal findings These must correspond to the findings discussed in the results section
Strengths and Weaknesses Strength and weaknesses of the
evidence included in the review
Relation to other reviews, particularly
considering any differences in quality
and results.
A discussion of the validity of the evidence considering bias in the
systematic review allows a reader to assess the reliance that may be placed
on the collected evidence.
Meaning of findings Direction and magnitude of effect
observed in summarised studies
Applicability (generalisability) of the
Make clear to what extent the result imply causality by discussing the level
of evidence.
Discuss all benefits, adverse effects and risks.
Discuss variations in effects and their reasons (for example are the treatment
effects larger on larger projects).
Practical implications for software
development What are the implications of the results for practitioners? Conclusions Recommendations
Unanswered questions and implications
for future research
Acknowledgements* All persons who contributed to the
research but did fulfil authorship
Conflict of Interest Any secondary interest on the part of the researchers (e.g. a financial interest
in the technology being evaluated) should be declared.
References and
Appendices Appendices can be used to list studies included and excluded from the study,
to document search strategy details, and to list raw data from the included
7. Final remarks
This report has presented a set of guidelines for planning conducting and reporting
systematic review. The guidelines are based on guidelines used in medical research.
However, it is important to recognise that software engineering research is not the
same as medical research. We do not undertake randomised clinical trials, nor can we
use blinding as a means to reduce distortions due to experimenter and subject
expectations. Thus, software engineering research studies usually provide only weak
evidence compared with RCTs.
We need to consider mechanisms to aggregate evidence from studies of different
types and to understand the extent to which we can rely on such evidence. At present,
these guidelines merely suggest that data from primary studies should be
accompanied by information about the type of primary study and its quality. As yet,
there is no definitive method for accumulating evidence from studies of different
types. Furthermore, there is disagreement among medical researchers about how
much reliance can be placed on evidence from studies other than RCTs. However, the
limited number of primary studies in software engineering imply that it is critical to
consider evidence from all types of primary study, including laboratory/academic
experiments, and as well as evidence obtained from experts.
Finally, these guidelines are intended to assist PhD students as well as larger research
groups. However, many of the steps in a systematic review assume that it will be
undertaken by a large group of researchers. In the case of a single researcher (such as
PhD student), we suggest the most important steps to undertake are:
Developing a protocol.
Defining the research question.
Specifying what will be done to address the problem of a single researcher
applying inclusion/exclusion criteria and undertaking all the data extraction.
Defining the search strategy.
Defining the data to be extracted from each primary study including quality data.
Maintaining lists of included and excluded studies.
Using the data synthesis guidelines.
Using the reporting guidelines
8. References
[1] Australian National Health and Medical Research Council. How to review the
evidence: systematic identification and review of the scientific literature, 2000.
IBSN 186-4960329.
[2] Australian National Health and Medical Research Council. How to use the
evidence: assessment and application of scientific evidence. February 2000,
ISBN 0 642 43295 2.
[3] Berlin, J.A., Miles, C.G., Crigliano, M.D. Does blinding of readers affect the
results of meta-analysis? Online J. Curr. Clin. Trials, 1997: Doc No 205.
[4] Cochrane Collaboration. Cochrane Reviewers’ Handbook. Version 4.2.1.
December 2003
[5] Cochrane Collaboration. The Cochrane Reviewers’ Handbook Glossary,
Version 4.1.5, December 2003.
[6] Cohen, J. Weighted Kappa: nominal scale agreement with provision for scaled
disagreement or partial credit. Pychol Bull (70) 1968, pp. 213-220.
[7] Concato, John and Horowitz, Ralph. I. Beyond randomised versus observational
studies. The Lancet, vol363, Issue 9422, 22 May, 2004.
[8] Feinstein, A.R., and Horowitz, R.I. Problems with the “evidence” of “evidence-
based medicine”. Ann. J. Med., 1977, vol(103) pp529-535.
[9] Greenlaugh, Trisha. How to read a paper: Papers that summarise other papers
(systematic reviews and meta-analysies. BMJ, 315, 1997, pp. 672-675.
[10] Jasperson, Jon (Sean), Butler, Brian S., Carte, Traci, A., Croes, Henry J.P.,
Saunders, Carol, S., and Zhemg, Weijun. Review: Power and Information
Technology Research: A Metatriangulation Review. MIS Quarterly, 26(4): 397-
459, December 2002.
[11] Jørgensen, Magne and Carelius, Gunnar J. An Empirical Study of Software
Project Bidding, Submitted to IEEE TSE, 2004 (major revision required).
[12] Khan, Khalid, S., ter Riet, Gerben., Glanville, Julia., Sowden, Amanda, J. and
Kleijnen, Jo. (eds) Undertaking Systematic Review of Research on
Effectiveness. CRD’s Guidance for those Carrying Out or Commissioning
Reviews. CRD Report Number 4 (2nd Edition), NHS Centre for Reviews and
Dissemination, University of York, IBSN 1 900640 20 1, March 2001.
[13] Hardman, David, K, and Ayton, Peter. Arguments for qualitative risk
assessment: the StAR risk advisor. Expert Systems, Vol 14, No. 1., 1997, pp24-
[14] Lawlor, Debbie A., George Davey Smith, K Richard Bruckdorfer, Devi Kundu,
Shah. Ebrahim Those confounded vitamins: what can we learn from the
differences between observational versus randomised trial evidence? The
Lancet, vol363, Issue 9422, 22 May, 2004.
[15] Pai, Madhukar., McCulloch, Michael., and Colford, Jack. Systematic Review: A
Road Map Version 2.2. Systematic Reviews Group, UC Berkeley, 2002.
V2.2.pdf viewed 20 June 2004].
[16] Shadish, W.R., Cook, Thomas, D. and Campbell, Donald, T. Experimental and
Quasi-experimental Designs for Generalized Causal Inference. Houghton
Mifflin Company, 2002.
[17] Jan P Vandenbroucke. When are observational studies as credible as
randomised trials? The Lancet, vol363, Issue 9422, 22 May, 2004.
Appendix 1 Steps in a systematic review
Guidelines for systematic review in the medical domain have different view of the
process steps needed in a systematic review. The Systematic Reviews Group (UC
Berkely) present a very detailed process model [15], other sources present a coarser
process. These process steps are summarised in Table 10, which also attempts to
collate the different processes.
Pai et al. [15] have specified the review process steps at a more detailed level of
granularity than the other systematic review guidelines. In particular, they have made
explicit the iterative nature of the start of a systematic review process. The start-up
problem is not discussed in any of the other guidelines. However, it is clear that it is
difficult to determine the review protocol without any idea as to the nature of the
research question and vice-versa.
Table 10 Systematic review process proposed in different guidelines
Systematic Reviews Group
([15]) Australian National
Health and Medical
Research Council ([1])
Handbook ([4])
CRD Guidance ([12])
Identification of the
need for a review.
Preparation of a
proposal for a
systematic review
Developing a
Development of a
review protocol
Define the question &
develop draft protocol
Identify a few relevant
studies and do a pilot study;
specific inclusion/exclusion
criteria, test forms and
refine protocol.
Question Formulation Formulating the
Identify appropriate
Run searches on all relevant
data bases and sources.
Save all citations
(titles/abstracts) in a
reference manager.
Document search strategy.
Locating and
selecting studies for
Identification of
Selection of studies
Researchers (at least 2)
screen titles & abstracts.
Researchers meet & resolve
Get full texts of all articles.
Researchers do second
Articles remaining after
second screen is the final set
for inclusion
Finding Studies
Researchers extract data
including quality data. Appraisal and selection
of studies Assessment of
study quality Study quality
Researchers meet to resolve
disagreements on data
Compute inter-rater
Enter data into database
management software
Collecting data Data extraction &
monitoring progress
Import data and analyse
using meta-analysis
Pool data if appropriate.
Look for heterogeneity.
Summary and synthesis
of relevant studies Analysing &
presenting results Data synthesis
Interpret & present data.
Discuss generalizability of
conclusions and limitations
of the review.
Make recommendations for
practice or policy, &
Determining the
applicability of results.
Reviewing and
appraising the
economics literature.
Interpreting the
results The report and
Getting evidence into
... The analytic procedure thus entails finding, selecting, appraising or making sense of and synthesising data contained in documents. For the current study, content analysis was applied for reviewing literature and empirical studies reporting on previous studies in the creation and adoption of makerspaces in academic libraries, following the guidelines advanced by Kitchenham (2004). The review protocol was composed of the following elements: ...
... The inclusion criteria aim to identify studies that provide direct evidence about the research questions (Kitchenham, 2004). The literature review on the creation and adoption of makerspaces in the library context was conducted in major databases such as EBSCOhost, ScienceDirect, Springer, Emerald insight, Scopus, Web of Science and Google Scholar, to ensure inclusion of all relevant studies in content analysis. ...
Full-text available
Libraries of today are not just a place to consult books and other pedagogical materials but have completely transformed into a space where users can interact, create, and collaborate. Library and information centres are creating spaces called makerspaces in this digital transformation era, whereby researchers work together and share ideas in their various areas of specialisation. Makerspace are relatively new phenomena that create a collaborative and innovative environment for individuals to work on projects and learn about emerging technologies. Technology-centred makerspaces are increasingly being built in academic libraries, typically featuring high-tech machines and software that facilitate creation and design. This study investigated the creation and adoption of technology-centred makerspaces in academic libraries and the impact that makerspaces have on academic innovation. The study utilized literature review analyzed secondary data from articles, journals, periodicals, and publications to identify the need to design makerspaces, what is required in setting up a makerspace, and how academic libraries utilize makerspaces. The benefits accrued from makerspaces, barriers to effective adoption of these spaces, factors enabling adoption of makerspaces, and the state-of-the-art facilities offered by the library were also explored in this study. It is recommended that library management should not hesitate to establish makerspaces in their respective academic libraries, as this will aid in promoting knowledge-sharing, collaboration, creativity, and innovation.
... Uma revisão sistemática é um meio de identificar, avaliar e interpretar toda pesquisa disponível e relevante sobre uma questão de pesquisa, um tópico ou um fenômeno de interesse [Kitchenham 2004]. A condução de uma revisão sistemática supostamente apresenta uma avaliação justa do tópico de pesquisa à medida que utiliza uma metodologia de revisão rigorosa, confiável e passível de auditagem. ...
... Caso os resultados dos estudos sejam inconsistentes, as fontes de variação desses resultados podem ser estudadas. Maiores informações sobre revisões sistemáticas podem ser encontradas em [Biolchini et al. 2005] e [Kitchenham 2004]. ...
Conference Paper
A orientação a objetos (OO) alcançou considerável sucesso no desenvolvimento industrial. Entretanto, a condução de atividades de verificação e validação em software OO ainda é um desafio. Acreditamos que a utilização de técnicas de leitura em inspeções de software seja uma alternativa viável para garantir a qualidade do software OO. Em vista disso, o presente artigo descreve os resultados de uma revisão sistemática conduzida com o objetivo de identificar, analisar e avaliar técnicas de leitura aplicáveis na garantia de qualidade no desenvolvimento de sistemas OO, cujos resultados apontam para novos desafios de pesquisa nesta área.
... Foram considerados os termos presentes em qualquer tópico presente no artigo, sendo o título, resumo e ou palavras-chave. Conforme Kitchenham (2004) a revisão sistemática é um meio de identificar, avaliar e interpretar toda a pesquisa disponível relevante para uma questão de pesquisa específica, ou área temática ou fenômeno de interesse. Ainda segunda a autora, a revisão sistemática é indicada para identificar eventuais lacunas na pesquisa atual, a fim de sugerir áreas para futuras investigações. ...
... Buscando evitar vieses de análise por um único examinador, os artigos foram submetidos a análise de diferentes indivíduos. Conforme Kitchenham (2004) sempre que possível, a extração de dados deve ser realizada independentemente por dois ou mais pesquisadores, os dados dos pesquisadores devem ser comparados e os desacordos resolvidos por consenso entre pesquisadores ou por um outro pesquisador independente. ...
O presente artigo tem como objetivo identificar discussões contemporâneas e gaps teóricos na perspectiva contratual da Teoria da Agência. Para tal finalidade, é empregada uma revisão sistemática sobre artigos publicados e revisados por pares, entre os anos 2008 a 2017 que utilizaram a Teoria da Agência e o estudo sobre contratos. Os resultados apresentam de forma sistematizada: as pesquisas; os autores: a data de publicação; um resumo dos trabalhos; o objeto estudado; o tipo da pesquisa; a técnica de análise utilizada; os achados; e as lacunas que precisam de investigação. O trabalho fornece aos pesquisadores um catálogo das pesquisas sobre a temática no período analisado.
... Conforme os autores Zawacki-Richter et al. (2020) e Kitchenham (2004), a revisão sistemática de literatura é composta por um conjunto de pesquisas num tema específico, que resulta num contributo de uma sistematização de achados teóricos que se resumem apresentar evidências de uma temática, com identificação de possíveis lacunas e apresentação de sugestões para futuras investigações. Num universo de 80 mil artigos na temática de jogo sérios e segurança web, aplicada ao segmento sénior, a investigação focou-se em estudos que foram essenciais para o seu desenvolvimento, tanto teórico como prático. ...
Full-text available
Estudo de um caso orientado para públicos seniores The serious game Web Segura development: a case study for senior audiences Resumo O retrato de uma sociedade digital passa por esta reger o seu quotidiano conforme o acesso à Internet, em casa, no emprego e na vida social. Mas os seniores não sentem essa necessidade, apesar de esta situação se estar a alterar, uma vez que tudo ao seu redor se gere pelo digital. Estes aventuram-se pela navegação web sem consciência dos perigos que detém, como o roubo de dados pessoais, notícias falsas ou compras online fraudulentas. Assim, a investigação promove um jogo sério que expõe essas situações digitalmente inseguras e, através de desafios, um grupo de seniores da rede de Universidades Seniores altera os seus comportamentos. Estes jogam a Web Segura, um jogo educacional online, desenvolvido na plataforma WordPress com desafios do plugin H5P.
... The research method for this survey follows the formal systematic literature review methodology. In particular, this study is based on the guidelines proposed in [27] and [28]. As detailed below, we also took into account other surveys about related topics, such as text classification and embeddings. ...
Full-text available
Text classification results can be hindered when just the bag-of-words model is used for representing features, because it ignores word order and senses, which can vary with the context. Embeddings have recently emerged as a means to circumvent these limitations, allowing considerable performance gains. However, determining the best combinations of classification techniques and embeddings for classifying particular corpora can be challenging. This survey provides a comprehensive review of text classification approaches that employ embeddings. First, it analyzes past and recent advancements in feature representation for text classification. Then, it identifies the combinations of embedding-based feature representations and classification techniques that have provided the best performances for classifying text from distinct corpora, also providing links to the original articles, source code (when available) and data sets used in the performance evaluation. Finally, it discusses current challenges and promising directions for text classification research, such as cost-effectiveness, multi-label classification, and the potential of knowledge graphs and knowledge embeddings to enhance text classification.
... Finding answers to the questions asked requires a review of the literature on the subject. The systematic literature review was implemented according to the guidelines presented in [15]. ...
Malicious attacks are one of the main threats facing today's most used Android and Windows operating systems, as well as the Internet of Things (IoT) and web environments. Markov models and hidden Markov models have been used successfully over the past few decades to identify a variety of malicious activity, including as viruses, worms, Trojan horses, rootkits, ransomware, and phishing assaults. But they have their limits. One of their main limitations is that they are unable to detect subtle changes in malicious behaviour. This paper presents Markov models and hidden Markov models as a tool for detecting malicious attacks and briefly reviews different studies from the past five years that use these models as a detection tool. This review, based on publications drawn from three databases, outlines the continuing interest of security researchers in these models. Most of the chosen research papers show that these models are applied to create systems that have a detection accuracy of malicious attacks above 94%. This study can be helpful to beginners who are interested in starting their research in the field of detecting malicious attacks.
... The systematic literature review method is based on the methodology proposed by Barbara Kitchenham [24]. Figure 6 shows the systematic review process used in this work, which consists of three phases: ...
Full-text available
Automatic image description, also known as image captioning, aims to describe the elements included in an image and their relationships. This task involves two research fields: computer vision and natural language processing; thus, it has received much attention in computer science. In this review paper, we follow the Kitchenham review methodology to present the most relevant approaches to image description methodologies based on deep learning. We focused on works using convolutional neural networks (CNN) to extract the characteristics of images and recurrent neural networks (RNN) for automatic sentence generation. As a result, 53 research articles using the encoder-decoder approach were selected, focusing only on supervised learning. The main contributions of this systematic review are: (i) to describe the most relevant image description papers implementing an encoder-decoder approach from 2014 to 2022 and (ii) to determine the main architectures, datasets, and metrics that have been applied to image description.
... This paper will comprehensively present the methods and algorithms in the PHM for IMs to solve practical engineering problems by comparing their pros and cons. Fig. 2 Application of prognostics and health management framework to induction machines A systematic literature review (SLR) is performed to synthesize the application of PHM to IMs by gathering related papers systematically (Kitchenham, 2004;Reis et al, 2022). SLR is a formal method to identify scientific evidence about a research topic, which can review state-of-the-art PHM methods applied to IMs unbiasedly. ...
Full-text available
Induction machines (IMs) are utilized in different industrial sectors such as manufacturing, transportation, transmission, and energy due to their ruggedness, low cost, and high efficiency. If IMs fail without advanced warning, unscheduled maintenance needs to be performed, leading to downtime and maintenance costs for asset owners. To avoid these, conducting prognostics and health management (PHM) for IMs is indispensable. There are different PHM methods (expert knowledge, physics-based, and machine learning) to analyze the health and estimate the remaining useful life (RUL) of IMs. It is essential to select appropriate methods and algorithms to solve practical engineering problems by comparing their pros and cons. This paper will systematically summarize the application of the PHM framework to IMs and comprehensively present how to select appropriate general methods as well as specific algorithms Springer Nature 2021 L A T E X template 3 applied in the PHM for IMs to solve practical engineering problems, aiming to provide some guidance for future researchers and practitioners.
Context: In today’s health care, multi-modal image registration increasingly important role in medical analysis and diagnostics. Multi-modal image registration is a challenging task because of the different imaging conditions that changes from one imaging modality to another.Objective: The purpose of this work is to determine the current state of the art in the field of medical image registration shedding light on techniques that have been used to register medical image combinations from different modalities and the importance of combining different modalities in automatic way in the medical domain.Method: To fulfill this objective we chose a Systematic Literature Review (SLR) as method to follow. Which allows to collect and structure the information that exists in the field of multi-modal image registration.Results: Several automatic solutions based on different registration techniques were proposed according to each specific modality combination.Conclusion: The results provide the following conclusions: First, the machine learning in the recent years plays an important role in the automatic registration process. An important number of research propose a learning-based registration solution. Second, There few solutions in literature that tackle the automatic registration of histology - CT modality combination. Finally, the existing research work propose registration solutions for only combination of two modalities. A very few number of work suggest a tri-modality combining.KeywordsSystematic Literature ReviewSLRMedical image registrationMulti-modal imageRegistration
Full-text available
During the last years, a number of studies have experimented with applying process mining (PM) techniques to smart spaces data. The general goal has been to automatically model human routines as if they were business processes. However, applying process-oriented techniques to smart spaces data comes with its own set of challenges. This paper surveys existing approaches that apply PM to smart spaces and analyses how they deal with the following challenges identified in the literature: choosing a modelling formalism for human behaviour; bridging the abstraction gap between sensor and event logs; and segmenting logs in traces. The added value of this article lies in providing the research community with a common ground for some important challenges that exist in this field and their respective solutions, and to assist further research efforts by outlining opportunities for future work.
Full-text available
Introduction Remember the essays you used to write as a student? You would browse through the indexes of books and journals until you came across a paragraph that looked relevant, and copied it out. If anything you found did not fit in with the theory you were proposing, you left it out. This, more or less, constitutes the methodology of the journalistic review—an overview of primary studies which have not been identified or analysed in a systematic (standardised and objective) way. Summary points A systematic review is an overview of primary studies that used explicit and reproducible methods A meta-analysis is a mathematical synthesis of the results of two or more primary studies that addressed the same hypothesis in the same way Although meta-analysis can increase the precision of a result, it is important to ensure that the methods used for the review were valid and reliable In contrast, a systematic review is an overview of primary studies which contains an explicit statement of objectives, materials, and methods and has been conducted according to explicit and reproducible methodology (fig 1). View larger version:In a new windowDownload as PowerPoint SlideFig 1 Methodology for a systematic review of randomised controlled trials1 Some advantages of the systematic review are given in box. When a systematic review is undertaken, not only must the search for relevant articles be thorough and objective, but the criteria used to reject articles as “flawed” must be explicit and independent of the results of those trials. The most enduring and useful systematic reviews, notably those undertaken by the Cochrane Collaboration, are regularly updated to incorporate new evidence.2 Box 1: Advantages of systematic reviews3 Explicit methods limit bias in identifying and rejecting studiesConclusions are more reliable and accurate because of methods usedLarge amounts of information can be assimilated quickly by healthcare providers, researchers, and policymakersDelay between research discoveries and implementation of effective diagnostic and therapeutic strategies may be reducedResults of different studies can be formally compared to establish generalisability of findings and consistency (lack of heterogeneity) of resultsReasons for heterogeneity (inconsistency in results across studies) can be identified and new hypotheses generated about particular subgroupsQuantitative systematic reviews (meta-analyses) increase the precision of the overall result RETURN TO TEXT Many, if not most, medical review articles are still written in narrative or journalistic form. Professor Paul Knipschild has described how Nobel prize winning biochemist Linus Pauling used selective quotes from the medical literature to “prove” his theory that vitamin C helps you live longer and feel better.3 4 When Knipschild and his colleagues searched the literature systematically for evidence for and against this hypothesis they found that, although one or two trials did strongly suggest that vitamin C could prevent the onset of the common cold, there were far more studies which did not show any beneficial effect. Experts, who have been steeped in a subject for years and know what the answer “ought” to be, are less able to produce an objective review of the literature in their subject than non-experts.5 6 This would be of little consequence if experts' opinions could be relied on to be congruent with the results of independent systematic reviews, but they cannot.7 Evaluating systematic reviews Question 1: Can you find an important clinical question which the review addressed? The question addressed by a systematic review needs to be defined very precisely, since the reviewer must make a dichotomous (yes/no) decision as to whether each potentially relevant paper will be included or, alternatively, rejected as “irrelevant.” Thus, for example, the clinical question “Do anticoagulants prevent strokes in patients with atrial fibrillation?” should be refined as an objective: “To assess the effectiveness and safety of warfarin-type anticoagulant therapy in secondary prevention (that is, following a previous stroke or transient ischaemic attack) in patients with non-rheumatic atrial fibrillation: comparison with placebo.”8 Question 2: Was a thorough search done of the appropriate databases and were other potentially important sources explored? Even the best Medline search will miss important papers, for which the reviewer must approach other sources.9 Looking up references of references often yields useful articles not identified in the initial search,10 and an exploration of “grey literature” (box) may be particularly important for subjects outside the medical mainstream, such as physiotherapy or alternative medicine.11 Finally, particularly where a statistical synthesis of results (meta-analysis) is contemplated, it may be necessary to write and ask the authors of the primary studies for raw data on individual patients which was never included in the published review. Box 2: Checklist of data sources for a systematic review Medline databaseCochrane controlled clinical trials registerOther medical and paramedical databasesForeign language literature“Grey literature” (theses, internal reports, non-peer reviewed journals, pharmaceutical industry files)References (and references of references, etc) listed in primary sourcesOther unpublished sources known to experts in the field (seek by personal communication)Raw data from published trials (seek by personal communication) RETURN TO TEXT Question 3: Was methodological quality assessed and the trials weighted accordingly? One of the tasks of a systematic reviewer is to draw up a list of criteria, including both generic (common to all research studies) and particular (specific to the field) aspects of quality, against which to judge each trial (see box). However, care should be taken in developing such scores since there is no gold standard for the “true” methodological quality of a trial12 and composite quality scores are often neither valid nor reliable in practice.13 14 The various Cochrane collaborative review groups are developing topic-specific methodology for assigning quality scores to research studies.15 Box 3: Assigning weight to trials in a systematic review Each trial should be evaluated in terms of its: Methodological quality—the extent to which the design and conduct are likely to have prevented systematic errors (bias)Precision—a measure of the likelihood of random errors (usually depicted as the width of the confidence interval around the result)External validity—the extent to which the results are generalisable or applicable to a particular target population RETURN TO TEXT Question 4: How sensitive are the results to the way the review has been done? Carl Counsell and colleagues “proved” (in the Christmas 1994 issue of the BMJ) an entirely spurious relationship between the result of shaking a dice and the outcome of an acute stroke.16 They reported a series of artificial dice rolling experiments in which red, white, and green dice represented different therapies for acute stroke. Overall, the “trials” showed no significant benefit from the three therapies. However, the simulation of a number of perfectly plausible events in the process of meta-analysis—such as the exclusion of several of the “negative” trials through publication bias, a subgroup analysis which excluded data on red dice therapy (since, on looking back at the results, red dice appeared to be harmful), and other, essentially arbitrary, exclusions on the grounds of “methodological quality”—led to an apparently highly significant benefit of “dice therapy” in acute stroke. If these simulated results pertained to a genuine medical controversy, how would you spot these subtle biases? You need to work through the “what ifs”. What if the authors of the systematic review had changed the inclusion criteria? What if they had excluded unpublished studies? What if their “quality weightings” had been assigned differently? What if trials of lower methodological quality had been included (or excluded)? What if all the patients unaccounted for in a trial were assumed to have died (or been cured)? View larger version:In a new windowDownload as PowerPoint Slide PETER BROWN An exploration of what ifs is known as a sensitivity analysis. If you find that fiddling with the data in various ways makes little or no difference to the review's overall results, you can assume that the review's conclusions are relatively robust. If, however, the key findings disappear when any of the what ifs changes, the conclusions should be expressed far more cautiously and you should hesitate before changing your practice in the light of them. Question 5: Have the numerical results been interpreted with common sense and due regard to the broader aspects of the problem? Any numerical result, however precise, accurate, “significant,” or otherwise incontrovertible, must be placed in the context of the painfully simple and often frustratingly general question which the review addressed. The clinician must decide how (if at all) this numerical result, whether significant or not, should influence the care of an individual patient. A particularly important feature to consider when undertaking or appraising a systematic review is the external validity or relevance of the trials that are included. Meta-analysis for the non-statistician A good meta-analysis is often easier for the non-statistician to understand than the stack of primary research papers from which it was derived. In addition to synthesising the numerical data, part of the meta-analyst's job is to tabulate relevant information on the inclusion criteria, sample size, baseline patient characteristics, withdrawal rate, and results of primary and secondary end points of all the studies included. Although such tables are often visually daunting, they save you having to plough through the methods sections of each paper and compare one author's tabulated results with another author's pie chart or histogram. These days, the results of meta-analyses tend to be presented in a fairly standard form, such as is produced by the computer software MetaView.3 is a pictorial representation (colloquially known as a “forest plot”) of the pooled odds ratios of eight randomised controlled trials which each compared coronary artery bypass grafting with percutaneous coronary angioplasty in the treatment of severe angina.17 The primary (main) outcome in this meta-analysis was death or heart attack within one year. View larger version:In a new windowDownload as PowerPoint SlideFig 2 Pooled odds ratios of eight randomised controlled trials of coronary artery bypass grafting against percutaneous coronary angioplasty, shown in MetaView format. Reproduced with authors' permission17 The horizontal line corresponding to each of the eight trials shows the relative risk of death or heart attack at one year in patients randomised to coronary angioplasty compared to patients randomised to bypass surgery. The “blob” in the middle of each line is the point estimate of the difference between the groups (the best single estimate of the benefit in lives saved by offering bypass surgery rather than coronary angioplasty), and the width of the line represents the 95% confidence interval of this estimate. The black line down the middle of the picture is known as the “line of no effect,” and in this case is associated with a relative risk of 1.0. If the confidence interval of the result (the horizontal line) crosses the line of no effect (the vertical line), that can mean either that there is no significant difference between the treatments or that the sample size was too small to allow us to be confident where the true result lies. The various individual studies give point estimates of the relative risk of coronary angioplasty compared with bypass surgery of between about 0.5 and 5.0, and the confidence intervals of some studies are so wide that they do not even fit on the graph. Now look at the tiny diamond below all the horizontal lines. This represents the pooled data from all eight trials (overall relative risk of coronary angioplasty compared with bypass surgery=1.08), with a new, much narrower, confidence interval of this relative risk (0.79 to 1.50). Since the diamond firmly overlaps the line of no effect, we can say that there is probably little to choose between the two treatments in terms of the primary end point (death or heart attack in the first year). Now, in this example, every one of the eight trials also suggested a non-significant effect, but in none of them was the sample size large enough for us to be confident in that negative result. Note, however, that this neat little diamond does not mean that you might as well offer coronary angioplasty rather than bypass surgery to every patient with angina. It has a much more limited meaning—that the average patient in the trials presented in this meta-analysis is equally likely to have met the primary outcome (death or myocardial infarction within a year), whichever of these two treatments they were randomised to receive. If you read the paper by Pocock and colleagues17 you would find important differences in the groups in terms of prevalence of angina and requirement for further operative intervention after the initial procedure. Explaining heterogeneity In the language of meta-analysis, homogeneity means that the results of each individual trial are mathematically compatible with the results of any of the others. Homogeneity can be estimated at a glance once the trial results have been presented in the format illustrated in figures 3 and 4. In 3 the lower confidence limit of every trial is below the upper confidence limit of all the others (that is, the horizontal lines all overlap to some extent). Statistically speaking, the trials are homogeneous. Conversely, in 4 some lines do not overlap at all. These trials may be said to be heterogeneous. View larger version:In a new windowDownload as PowerPoint SlideFig 3 Reduction in risk of heart disease by strategies for lowering cholesterol. Reproduced with permission from Chalmers and Altman18 The definitive test for heterogeneity involves a slightly more sophisticated statistical manoeuvre than holding a ruler up against the forest plot. The one most commonly used is a variant of the χ2 (chi square) test, since the question addressed is whether there is greater variation between the results of the trials than is compatible with the play of chance. Thompson18 offers the following rule of thumb: a χ2 statistic has, on average, a value equal to its degrees of freedom (in this case, the number of trials in the meta-analysis minus one), so a χ2 of 7.0 for a set of eight trials would provide no evidence of statistical heterogeneity. Note that showing statistical heterogeneity is a mathematical exercise and is the job of the statistician, but explaining this heterogeneity (looking for, and accounting for, clinical heterogeneity) is an interpretive exercise and requires imagination, common sense, and hands-on clinical or research experience. 4 shows the results of ten trials of cholesterol lowering strategies. The results are expressed as the percentage reduction in risk of heart disease associated with each reduction of 0.6 mmol/l in serum cholesterol concentration. From the horizontal lines which represent the 95% confidence intervals of each result it is clear, even without knowing the χ2 statistic of 127, that the trials are highly heterogeneous. Correcting the data for the age of the trial subjects reduced this value to 45. In other words, much of the “incompatibility” in the results of these trials can be explained by the fact that embarking on a strategy which successfully reduces your cholesterol level will be substantially more likely to prevent a heart attack if you are 45 than if you are 85. Clinical heterogeneity, essentially, is the grievance of Professor Hans Eysenck, who has constructed a vigorous and entertaining critique of the science of meta-analysis.19 In a world of lumpers and splitters, Eysenck is a splitter, and it offends his sense of the qualitative and the particular to combine the results of studies which were done on different populations in different places at different times and for different reasons. The articles in this series are excerpts from How to read a paper: the basics of evidence based medicine. The book includes chapters on searching the literature and implementing evidence based findings. It can be ordered from the BMJ Publishing Group: tel 0171 383 6185/6245; fax 0171 383 6662. Price £13.95 UK members, £14.95 non-members. Eysenck's reservations about meta-analysis are borne out in the infamously discredited meta-analysis which showed (wrongly) that giving intravenous magnesium to people who had had heart attacks was beneficial. A subsequent megatrial involving 58 000 patients (ISIS-4) failed to find any benefit, and the meta-analysts' misleading conclusions were subsequently explained in terms of publication bias, methodological weaknesses in the smaller trials, and clinical heterogeneity.20 21Acknowledgments Thanks to Professor Iain Chalmers for advice on this chapter. References1.↵The Cochrane Centre.Cochrane CollaborationHandbook [updated 9 December 1996]. The Cochrane Collaboration; issue 1. Oxford: Update Software, 1997.2.↵Bero L, Rennie D.The Cochrane Collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care.JAMA 1995;274:1935–8.3.↵Chalmers I, Altman DG, eds. Systematic reviews. London: BMJ Publishing Group, 1995.4.↵Pauling L.How to live longer and feel better. New York: Freeman, 1986.5.↵Oxman AD, Guyatt GH.The science of reviewing research.Ann NY Acad Sci 1993;703: 125–31.6.↵Mulrow C.The medical review article: state of the science.Ann Intern Med 1987;106: 485–8.7.↵Antman EM, Lau J, Kupelnick B, Mosteller F, Chalmers TC.A comparison of results of meta-analyses of randomised controlled trials and recommendations of clinical experts.JAMA 1992;268: 240–8.8.↵Koudstaal P.Secondary prevention following stroke orTIA in patients with non-rheumatic atrial fibrillation: anticoagulant therapy versus control. Cochrane Database of Systematic Reviews. Oxford: Cochrane Collaboration, 1995. (Updated 14 February 1995.)9.↵Greenhalgh T.Searching the literature. In: How to read a paper. London: BMJ Publishing Group, 1997: 13–33.10.↵Chalmers I, Altman DGKnipschild P.Some examples of systematic reviews. In: Chalmers I, Altman DG.Systematic reviews. London: BMJ PublishingGroup, 1995: 9–16.11.↵Knipschild P.Searching for alternatives: loser pays. Lancet 1993;341: 1135–6.12.↵Oxman A, ed.Preparing and maintaining systematic reviews. In: Cochrane Collaboration handbook, section VI. Oxford: Cochrane Collaboration, 1995. (Updated 14 July 1995.)13.↵Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC.An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials.Controlled Clin Trials 1990;11: 339–52.14.↵Moher D, Jadad AR, Tugwell P.Assessing the quality of randomized controlled trials: current issues and future directions.IntJ Health Technol Assess 1996;12: 195–208.15.↵Garner P, Hetherington J.Establishing and supporting collaborative review groups. In: Cochrane Collaboration handbook, sectionII. Oxford: Cochrane Collaboration, 1995 (Updated 14 July 1995.)16.↵Counsell CE, Clarke MJ, Slattery J, Sandercock PAG.The miracle of DICE therapy for acute stroke: fact or fictional product of subgroup analysis?BMJ 1994;309: 1677–81.OpenUrlFREE Full Text17.↵Pocock SJ, Henderson RA, Rickards AF, Hampton JR, Sing SB III, Hamm CW, et al. Meta-analysis of randomised trials comparing coronary angioplasty with bypass surgery.Lancet 1995;346: 1184–9.18.↵Chalmers I, Altman DGThompson SG.Why sources of heterogeneity in meta-analysis should be investigated. In:Chalmers I, Altman DG.Systematicreviews. London, BMJ Publishing Group, 1995: 48–63.19.↵Chalmers I, Altman DGEysenck HJ.Problems with meta-analysis. In: Chalmers I, Altman DG.Systematic reviews. London: BMJ Publishing Group, 1995: 64–74.20.↵Magnesium, myocardial infarction, meta-analysis and mega-trials. Drug Ther Bull 1995;33: 25–7.21.↵Egger M, Davey Smith G.Misleading meta-analysis: lessons from “an effective, safe, simple” intervention that wasn't.BMJ 1995;310: 752–4.OpenUrlFREE Full Text
Full-text available
This study uses a metatriangulation approach to explore the relationships between power and information technology impacts, development or deployment, and management or use in a sample of 82 articles from 12 management and MIS journals published between 1980 and 1999. We explore the multiple paradigms underlying this research by applying two sets of lenses to examine the major findings from our sample. The technological imperative, organizational imperative, and emergent perspectives (Markus and Robey 1988) are used as one set of lenses to better understand researchers' views regarding the causal structure between IT and organizational power. A second set of lenses, which includes the rational, pluralist, interpretive, and radical perspectives (Bradshaw-Camball and Murray 1991), is used to focus on researchers' views of the role of power and different IT outcomes. We apply each lens separately to describe patterns emerging from the previous power and IT studies. In addition, we discuss the similarities and differences that occur when the two sets of lenses are simultaneously applied. We draw from this discussion to develop metaconjectures, (i.e., propositions that can be interpreted from multiple perspectives), and to suggest guidelines for studying power in future research.
Calculating risk is relatively straightforward when there is reliable statistical evidence on which to base a judgment. However, novel technologies are often characterised by a lack of such historical data, which creates a problem for risk assessment. In fact, numerical risk assessments can be positively misleading in such situations. We describe a decision support system – StAR – that gives quantitative assessments where appropriate, but which is also able to provide qualitative risk assessments based on arguments for and against the presence of risk. The user is presented with a summary statement of risk, together with the arguments that underlie this assessment. Furthermore, the user is able to search beyond these top-level arguments in order to discover more about the available evidence. Here we suggest that this approach is well-suited to the way in which people naturally make decisions, and we show how the StAR approach has been implemented in the domain of toxicological risk assessment.
A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joint nominal scale assignments such that disagreements of varying gravity (or agreements of varying degree) are weighted accordingly. Although providing for partial credit, Kw is fully chance corrected. Its sampling characteristics and procedures for hypothesis testing and setting confidence limits are given. Under certain conditions, Kw equals product-moment r. The use of unequal weights for symmetrical cells makes Kw suitable as a measure of validity.
The proposed practice of "evidence-based medicine," which calls for careful clinical judgment in evaluating the "best available evidence," should be differentiated from the special collection of data regarded as suitable evidence. Although the proposed practice does not seem new, the new collection of "best available" information has major constraints for the care of individual patients. Derived almost exclusively from randomized trials and meta-analyses, the data do not include many types of treatments or patients seen in clinical practice; and the results show comparative efficacy of treatment for an "average" randomized patient, not for pertinent subgroups formed by such cogent clinical features as severity of symptoms, illness, co-morbidity, and other clinical nuances. The intention-to-treat analyses do not reflect important post-randomization events leading to altered treatment; and the results seldom provide suitable background data when therapy is given prophylactically rather than remedially, or when therapeutic advantages are equivocal. Randomized trial information is also seldom available for issues in etiology, diagnosis, and prognosis, and for clinical decisions that depend on pathophysiologic changes, psychosocial factors and support, personal preferences of patients, and strategies for giving comfort and reassurance. The laudable goal of making clinical decisions based on evidence can be impaired by the restricted quality and scope of what is collected as "best available evidence." The authoritative aura given to the collection, however, may lead to major abuses that produce inappropriate guidelines or doctrinaire dogmas for clinical practice.
Observational studies have a record of extremely successful contributions to medicine. They are essential for our knowledge about causes and pathogenesis—eg, genetic, environmental, or infectious causes of disease. Additionally, for medical practice we rely on observational studies of prognosis and diagnosis. Nevertheless, over the past years, we have seen recurrent debates about the merit of observational versus randomised research. The debates have been fuelled recently because of seeming total failures, in which the results of observational studies were completely overturned by randomised studies. Hormone replacement therapy showed protection from myocardial infarction in observational studies, but a small increase was seen in randomised trials; a similar reversal happened for carotene and lung cancer. Such discrepancies raise the question: in what circumstances can observational comparisons be as convincing as randomised experiments? To answer that question: I will first recall what is expected from randomisation. I will then describe two specific issues, adverse effects of drugs and genetic causes of disease, to elucidate under what circumstances evidence from observational research is as good as that from randomised trials. This description can be generalised to other areas of clinical and epidemiological research, and leads to a proposal for a three-pronged restriction to give observational research the best chance to be as credible as randomised controlled trials. Benefits of randomisation Two benefits are expected from randomisation: unbiased allocation of treatment, because of easier concealment of the allocation scheme, 1 and application of statistical theory on the basis of random sampling. 2
Both these papers were published in The Lancet, both are thought to be methodologically sound, and both are widely cited, yet their conclusions are contradictory. In this example, in the randomised trial vitamin C was part of a multivitamin supplement, whereas in the observa- tional study plasma concentrations of vitamin C were assessed. However, it is difficult to see why a combination of vitamin C with other vitamins should reduce its protective effects, if they were real. Furthermore, in the case of other antioxidant vitamins, notably vitamin E, results of single-factor trials and observational studies show a similar discordance to those seen in our figure for vitamin C. 2-4 Why did observational studies and randomised controlled trials come up with different answers? Several reasons have been proposed for the disparity between the results of observational epidemiological studies and trials. First, antioxidants might be useful only for primary prevention of cardiovascular disease, and not protective once atherosclerosis is established. 6 However, trials 4,7 found that antioxidants did not reduce cardio- vascular disease risk in participants who had no evidence of this disorder at randomisation. Second, in many of the trials the choice of antioxidant regimen has been criticised. 6