A Microhomology-Mediated Break-Induced Replication
Model for the Origin of Human Copy Number Variation
P. J. Hastings1*, Grzegorz Ira1, James R. Lupski1,2,3
1Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America, 2Department of Pediatrics, Baylor College of
Medicine, Houston, Texas, United States of America, 3Texas Children’s Hospital, Houston, Texas, United States of America
Abstract: Chromosome structural changes with nonre-
current endpoints associated with genomic disorders
offer windows into the mechanism of origin of copy
number variation (CNV). A recent report of nonrecurrent
duplications associated with Pelizaeus-Merzbacher dis-
ease identified three distinctive characteristics. First, the
majority of events can be seen to be complex, showing
discontinuous duplications mixed with deletions, inverted
duplications, and triplications. Second, junctions at end-
points show microhomology of 2–5 base pairs (bp). Third,
endpoints occur near pre-existing low copy repeats
(LCRs). Using these observations and evidence from
DNA repair in other organisms, we derive a model of
microhomology-mediated break-induced replication
(MMBIR) for the origin of CNV and, ultimately, of LCRs.
We propose that breakage of replication forks in stressed
cells that are deficient in homologous recombination
induces an aberrant repair process with features of break-
induced replication (BIR). Under these circumstances,
single-strand 39 tails from broken replication forks will
anneal with microhomology on any single-stranded DNA
nearby, priming low-processivity polymerization with
multiple template switches generating complex rear-
rangements, and eventual re-establishment of processive
In the past few years, we have learnt that a major component of
the differences between individuals is variation in the number of
copies of segments of the genome, and of genes included in these
segments (copy number variation or CNV) (for definition of
abbreviations, see Table 1). A considerable portion of the genome
is involved in CNV [1–11]—with estimates of up to 12% —
which can arise meiotically and also somatically as shown by the
finding that identical twins can differ in CNV . CNV has been
a significant component of primate evolution [13–16]. Here we
draw on evidence on the mechanism of DNA transactions in
Escherichia coli, yeast, Drosophila, mammals, and human cancer to
derive a model for the origin of CNV based on the mechanism of
BIR occurring at sites of microhomology (microhomology-
mediated BIR or MMBIR).
Although we can see that considerable variation in copy
number is tolerated or is advantageous to its carrier, some genes
are dosage-sensitive, and duplication or deletion involving these
genes gives rise to human clinical phenotypes collectively referred
to as genomic disorders . This has allowed the ascertainment
of structural changes and thus the study of the origin of CNV. For
recurrent rearrangements, much CNV stems from homologous
recombination between segments that already occur as two or
more copies. When this happens, sequences that lie between the
repeats that recombine will be either duplicated or deleted, thus
changing the copy number. This process is referred to as nonallelic
homologous recombination, or NAHR . The repeated
sequences that recombine might occasionally be highly repetitive
sequences that occur widely in the human genome  but are
usually sequences that occur only twice or a few times (i.e., low-
copy repeats, LCRs, or segmental duplications, SDs). The LCRs
tend to occur in clusters in highly complex regions of the genome.
These repeated segments might be short (about 10 kilobases (kb)),
or up to several hundreds of kb in length, and they occur in either
orientation. Some examples of genomic complex regions are
shown in Figure 1.
The endpoints of CNVs that arose by NAHR occur in a few
positions where there is sufficient homology for homologous
recombination. Although many genomic disorders arise by NAHR
, some rearrangements have endpoints in many different
positions. These CNVs arose de novo by rearrangements at sites
that lack extensive homology. Recent evidence on the distribution
of nonpathological CNVs in two individuals suggests that most
differences in copy number from the reference sequence arose by
nonrecurrent events . Thus nonrecurrent chromosomal chang-
es arise quite frequently . Because the nonrecurrent events
presumably reflect the origin of most genome complexity, the
study of them is important to the understanding of genomic
disorders, genetic variability due to CNV, and human evolution.
Pelizaeus-Merzbacher disease (PMD; Online Mendelian Inher-
itance in Man (OMIM) accession code 312080; http://www.ncbi.
nlm.nih.gov/omim/) is a recessive X-linked genomic disorder
affecting the central nervous system that arises by nonrecurrent
chromosomal changes. The changes involve duplication, triplica-
tion, or deletion of the PLP1 gene. The clinical phenotype allows
identification of individuals showing nonrecurrent chromosomal
changes in the PLP region. In a study of the structural variation in
the genomes of patients with PMD, Lee et al.  describe some
aspects of the fine structure of newly arising CNVs with
nonrecurrent endpoints and report three striking properties of
Citation: Hastings PJ, Ira G, Lupski JR (2009) A Microhomology-Mediated Break-
Induced Replication Model for the Origin of Human Copy Number Variation. PLoS
Genet 5(1): e1000327. doi:10.1371/journal.pgen.1000327
Published January 30, 2009
Copyright: ? 2009 Hastings et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: This work was supported by grants from the National Institutes of
Health; R01 GM64022 to PJH and R01 GM80600 to GI.
* E-mail: firstname.lastname@example.org
Editor: Ivan Matic, Universite ´ Paris Descartes, INSERM U571, France
PLoS Genetics | www.plosgenetics.org1 January 2009 | Volume 5 | Issue 1 | e1000327
their structure that help us to understand the origin of CNVs.
First, the authors report that the novel junctions form at sites of
microhomology, i.e., lengths of homology 2 to 5 nucleotides long
that are too short to support homologous recombination. Such
junctions have been reported previously in cases of nonrecurrent
endpoints of deletions and duplications [19,23,24]. Second, they
observed that the new structures are complex, showing duplication
and deletion interspersed with nonduplicated or with triplicated
lengths, and showing duplicated segments in either orientation.
These characteristics were reported previously [25–31]. Third,
although these events did not arise by NAHR, the novel junctions
tend to occur in close proximity to LCRs [32–34]. Figures 2 and 3
illustrate examples of these complex non-recurrent events.
Nonrecurrent rearrangements had previously been attributed to
amechanism of nonhomologous
[19,20,24,33]. However, the characteristics of microhomology
junctions and structural complexity in these new structures, as
revealed by nucleotide sequencing and high-resolution array
comparative genomic hybridization, led Lee et al.  to propose
that the rearrangements arose through a replication-based
mechanism termed FoSTeS (fork stalling and template switching),
a mechanism proposed previously for amplification in E. coli .
Replication-based models have also been proposed to explain the
origin of gross chromosomal rearrangements seen in a low
proportion of patients with cystic fibrosis and hemophilia A.
Analysis of deletions of the genes involved reveals complex
structures similar to those described for PLP1 [28,29,31].
Genome Rearrangements in Cancer
The amount of structural variation in cancer cells is sometimes
so extreme  that it is not possible to determine which changes
occurred within the same event. However, it can be seen that
duplications are often discontinuous, and junction regions include
insertions of nearby, unlinked, and unknown sequences, and
deletions and inversions , showing that rearrangement events
in cancer cells are complex. Many studies report microhomology
at junctions of a large proportion of the structural variation (e. g.,
[37–39]). Studies of translocation endpoints in leukemia and other
cancers find that many junctions have microhomology and are
associated with insertions and deletions of various lengths [40–42].
These observations are compatible with at least some of the
genomic instability seen in tumor formation and progression
having stemmed from the same underlying mechanism as the
formation of nonrecurrent duplications in genomic disorders.
Involvement of Replication in Chromosomal
In the Lac assay system in E. coli , amplification of the lac
operon to 20–100 copies occurs in response to the stress of
starvation [44,45]. The novel junctions of the amplified segments
(amplicons) show that endpoints occurred at sites of microhomol-
ogy of 2–15 bp [35,46]. Some of the amplicons are complex,
containing both direct and inverted repeats. Many others cannot
Table 1. Abbreviations Used in the Text.
BIRBreak-induced replication, a recombination-based mechanism for restarting broken replication forks.
CNV Copy number variation, variation within a population of the number of copies of a gene or length of genome.
DSB Double-strand break, a break in both strands of a DNA molecule.
FoSTeS Fork stalling and template switching, a replicative mechanism for changing chromosome structure.
LCR Low copy repeat, a length of genome that occurs twice or a few times.
MMBIRMicrohomology-mediated break-induced replication, a replication-based mechanism of recombination between sequences with very little
base identity, proposed here.
NAHR Nonallelic homologous recombination, homologous recombination occurring between low copy repeats.
NHEJNonhomologous end joining, a mechanism for repair of DNA double-strand breaks that does not require homology.
SDSegmental duplication, a repetition of a length of genome.
Figure 1. In silico analyses revealed complex genomic architecture in regions of nonrecurrent rearrangement. (A) The ,3 Mb
surrounding the PLP1 gene and (B) the ,4 Mb surrounding the MECP2 gene on the X chromosome contain numerous LCRs in various orientations
[33,106]. LCRs are represented by the colored block arrows, and like LCR copies are designated by color and letter for a given sequence. Orientation is
depicted by the direction of the block arrow.
PLoS Genetics | www.plosgenetics.org2 January 2009 | Volume 5 | Issue 1 | e1000327
be identified by outward-facing polymerase chain reaction (an
observation also encountered frequently for PLP1 duplication
junction analysis ), which would reveal the junctions of simple
tandem repeats, and so are presumed to be complex, rather than
simple tandem repeats [35,46,47]. By these criteria, about 25% of
amplicons are complex. Thus, with respect to microhomology and
complexity, the chromosomal structural changes in this system
resemble those found in nonrecurrent events in human genomic
Homologous recombination requires RecA protein (Rad51 in
eukaryotes) (reviewed in ). Microhomology-mediated deletion
formation in E. coli (less than 25 nucleotides of homology) has long
been known to be RecA-independent [49–52]. RecA-independent
short homology-mediated deletions (25–50 nucleotides) have
previously been attributed to template switching within a
replication fork during DNA replication (reviewed in ). The
evidence for this is, first, that mutations in genes encoding
replication functions affect the formation of these events; second,
that mutations affecting post-replicational mismatch repair affect
them, placing the event very near to the replication fork; third,
that mutation of 39 exonucleases has an effect that is consistent
with the ends being used to prime DNA synthesis; and fourth, that
it is very difficult to obtain mutations affecting the process by
transposon mutagenesis, suggesting essential functions.
In the E. coli Lac system, study of genetic requirements of stress-
induced amplification has revealed some details of the mechanism.
First, the events involve 39 DNA ends. This is seen by an increase in
amplification when a 39 exonuclease gene (xonA) is deleted, and a
decrease when the 39 exonuclease is over-expressed. Similar
manipulation of 59-exonuclease has no effect . This suggests
that amplification results from free 39 ends in the cell most of which
arenormally removedby exonuclease.As above,theinvolvement of
39 ends but not 59 ends is consistent with priming of DNA synthesis.
Second, lagging-strand processing at replication forks is
implicated by a requirement for the 59 exonuclease domain of
DNA polymerase I (Pol I) [35,45]. Pol I is involved in lagging-
Figure 2. Complex rearrangements involving PLP1 detected by junction analysis (A) and oligonucleotide array comparative
genomic hybridization analysis (B) . (A) A complex duplication of the PLP1 region detected by outward facing polymerase chain reaction.
Panel (i) shows the PLP1 region with the positions of the outward facing primers. The structure of the duplicated region is shown in (ii), with an
enlargement of the complex junction region in (iii). Two or three bp of microhomology, shown by the letters A, C, G and T, were found at the
breakpoint junctions (open arrows). (B) Deletion and duplications found in two patients with Pelizaeus-Merzbacher disease and their carrier mother
, shown by comparative genomic hybridization. A ,190-kb deletion is followed by a ,9-kb segment with no copy-number change, and an
interrupted ,190-kb duplication was detected (i). Panel (ii) shows enlargement of the array revealing interruption of the ,190-kb duplication. In
each horizontal yellow box above, blue lines represent an average of the data points. Red data points indicate copy-number gains, green data points
indicate losses, and black data points indicate no copy-number change. The y-axes show relative hybridization; genomic position is on the x-axis.
Panel (iii) summarizes the structure based on comparative genomic hybridization where a green box shows the region deleted, red boxes show the
regions duplicated, and black lines show regions of no change.
PLoS Genetics | www.plosgenetics.org3 January 2009 | Volume 5 | Issue 1 | e1000327
strand replication, base excision repair, and nucleotide excision
repair, but these excision repair processes are not involved in
amplification , so lagging strands at replication forks are
implicated in amplification.
Third, there is a requirement for the proteins of double-strand
break (DSB) repair by homologous recombination  (the
RecBC system, reviewed in ). That this is actually a
requirement for DSB repair (not just the proteins) is shown by
the discovery that in vivo double-strand cleavage of DNA near lac
enhances amplification rates .
Taken together, these observations suggest a model for
amplification in the Lac system in E. coli in which replication is
restarted at sites of repair of DNA double-strand ends . The
hypothesis proposed was that template switching occurs during
replication restart at stalled replication forks. Because the distances
involved exceed the lengths that are expected to be exposed as
single-stranded at a single replication fork, it was proposed that the
switches occurred between different replication forks .
The idea that chromosomal structural changes originate from
DNA replication has received support from a study of micro-
homology-mediated SD formation in yeast . These authors
support the idea that the mechanism of SD formation involves
replication by showing that its frequency is enhanced by treatment
with camptothecin and is dependent on Pol32, a component of
Pold (discussed below). Camptothecin is a topoisomerase I
inhibitor that leaves nicks in DNA. These nicks are believed to
become collapsed forks when a replication fork reaches them.
Thus, increasing the frequency of fork collapse increases the
frequency of duplication formation. These authors also report that
situations that lead to fork stalling rather than collapse have little
effect on the frequency of duplication formation . Thus, it
appears that the substrate for duplication is a single double-strand
end at a collapsed replication fork.
This long-distance template-switch model was also used by Lee
et al.  to explain the observations of nonrecurrent chromo-
somal changes seen in Pelizaeus-Merzbacher disease discussed
above and the juxtaposition of multiple genomic sequences
normally separated by large genomic distances [22,56]. Experi-
ments on the integration of nonhomologous DNA into mamma-
lian cells revealed microhomology junctions and insertion of
sequence from other parts of the genome at the junctions. These
observations were interpreted in terms of a similar model of
repeated copying and switching to another template .
A more specific model for restarting replication at collapsed
(broken) replication forks, BIR , has been developed for yeast,
and a similar mechanism was proposed to explain telomere
maintenance in yeast and human cell lines that have lost
telomerase activity (reviewed in ). Recent evidence [60,61]
suggests that the BIR mechanism can be modified to explain the
complexity of chromosomal structural changes described above for
human and E. coli. Figure 4 illustrates the mechanism of BIR.
When the replicative helicase encounters a nick on the template
strand (Figure 4A), one arm of a replication fork breaks off
(Figure 4B). There is no second end to be involved in the
mechanisms of DSB repair that are available at a DSB consisting
of two double-strand ends: homologous recombination or
nonhomologous end-joining. The 59 end of the broken arm is
resected by an exonuclease to leave a 39 overhang (Figure 4C).
This 39 tail invades a homologous sequence, normally the sister
chromatid from which it came. This invasion is mediated by
RecA/Rad51 protein (Figure 4D). The 39 end primes DNA
synthesis and establishes a replication fork consisting of both
leading and lagging strand synthesis  (Figure 4E). This
replication is of low processivity, and the extended arm is
separated from the sister chromatid (Figure 4E). Such separation
might be achieved by migration of the Holliday junction shown in
Figure 4D and 4E. The 39 end reinvades and the process is
repeated (Figure 4G and 4H). After a few cycles of invasion,
extension, and separation, the replication fork becomes more
processive, and replication continues to the end of the chromo-
some arm or to the end of the replicon. The change from low
processivity to highly processive replication can be attributed to a
switch in the DNA polymerases involved . Initial extension
from a double-strand end was shown to require the primase
Figure 3. Complex genomic rearrangements at PLP1 seen in patients with Pelizaeus-Merzbacher disease, illustrating long-range as
well as short-range complexity. Duplications are shown in red, deletions in green, triplications in blue, and no copy number change in black. The
figure is not drawn to scale. Approximate positions are given relative to PLP1.
PLoS Genetics | www.plosgenetics.org4 January 2009 | Volume 5 | Issue 1 | e1000327
complex and Pold, notably the nonessential Pol32 subunit,
whereas the more processive Pole was required for the 30-kb
extension to the telomere. Figure 4I shows the completed pair of
chromatids with the new material segregating conservatively as
suggested for E. coli . This would result if the Holliday junction
followed the replication fork. Another possibility is that the
Holliday junction is resolved so that there will be semi-
conservative segregation of old and new DNA strands ,
(reviewed in ). Evidence for conservative segregation of new
DNA strands in BIR, suggesting that the Holliday junction was not
resolved, was reported for E. coli .
The repeated extension and separation have been interpreted as
repeated attempts to find the other side of a break consisting of two
double-strand ends. When, eventually, none is found because this
is a collapsed fork rather than a two ended DSB, the remainder of
the chromosome is replaced by replication [60,63]. The pattern of
repeated rounds of template switching followed by a long length of
replication is supported by observations of BIR in yeast. BIR can
be induced experimentally by transforming a chromosomal
fragment into a yeast cell . Using such a system, Smith et al.
 placed a chromosomal fragment with a centromere and one
telomere-forming sequence into a diploid yeast cell. The fragment
had homology to both homologues of chromosome III. These
homologues were differentially marked. Selection for a marker on
the fragment selected for cells in which the fragment had acquired
a second telomere. These authors found that most fragments had
completed the replication of 50 kb to the end of the chromosome
to which the fragment had homology. The striking result was that
many of the chromosomes recovered had switched from one
homologue to the other. In some cases, more than one switch was
seen. The switches were confined to the first 10 kb, after which a
single homologue was copied. In a few percent of cases, the switch
was to a different chromosome at sites of repeated homology
consisting of the long terminal repeat of a retrotransposon. Thus,
BIR was demonstrated to produce complexity of the sorts reported
above for E. coli amplification and for nonrecurrent end-points in
human genomic disorders.
BIR has been suggested as the mechanism that underlies SD
and other structural changes in yeast, e.g., [55,65,66], and human,
e.g., [31,67]. As discussed below, BIR is strongly RecA/Rad51-
dependent and homology-dependent, and so cannot account for
the observations of microhomology associated with complex
rearrangements without substantial change.
Microhomology-Mediated BIR (MMBIR)
BIR, as described above, is usually an accurate process, because
the repeated invasions are RecA/Rad51-mediated and involve long
lengths of homology between DNA sequences. Invasion catalyzed
by RecA/Rad51 requires extensive homology of about 50 bp in E.
coli  and more in eukaryotes [69,70]. This does not fit with the
microhomology junctions described above. We therefore suggest
that in these systems, replication forks are reestablished in a RecA/
Rad51-independent manner. Rad51-independent BIR occurs in
yeast at a much lower efficiency than the Rad51-dependent BIR
[71,72], though its frequency is very muchenhanced, at the expense
of fidelity, by the presence of unusual structures such as an inverted
repeat . However, telomere recombination in the absence of
telomerase is proficient in the absence of Rad51 and is mediated by
very short homologies [73,74] (reviewed in ). The fact that
telomere recombination occurs by BIR is supported by the finding
that it requires the same set of enzymes as BIRthat is initiated in the
middle ofa chromosome . Absenceorshortage ofRecA/Rad51
might arise because the cells are stressed, as described below. That
microhomology-mediated SD formation occurs in yeast by a BIR
mechanism is supported by the finding that, like homology-
mediated BIR , it requires Pol32 .
In mammalian cells, there is a surprisingly efficient micro-
homology-mediated DSB repair pathway. Most, if not all,
experimental research on microhomology-mediated DSB repair
has been performed with nuclease-induced breaks. This recently
Figure 4. Repair of a collapsed replication fork by BIR. When a replication fork encounters a nick in a template strand (A) (arrowhead), one arm
of the fork breaks off (red), producing a collapsed fork (B). At the single double-strand end, the 59 strand is resected, giving a 39 overhang (C). The 39
single-strand end invades the sister molecule (blue), forming a D-loop (D), which subsequently becomes a replication fork with both leading and
lagging strand replication (E). There is a Holliday junction at the site of the D-loop. Migration of the Holliday junction, or some other helicase activity,
separates the extended double-strand end from its templates (F). The separated end is again processed to give a 39 single-strand end, which again
invades the sister, and forms a replication fork (G). Eventually the replication fork becomes fully processive, and continues replication to the
chromosome end (H and I). This process is shown here with the Holliday junction following the fork so that newly formed strands are segregated
together (conservative segregation) (H). Each line represents a DNA nucleotide chain (strand). Polarity is indicated by half arrows on 39 end. New DNA
synthesis is shown by dashed lines. The publications on which this model is based are cited in the text.
PLoS Genetics | www.plosgenetics.org5 January 2009 | Volume 5 | Issue 1 | e1000327