ArticlePDF Available

Distinct patterns of somatic alterations in a lymphoblastoid and a tumor genome derived from the same individual

Authors:
  • Idengene Medicina Diagnóstica, LTDA

Abstract and Figures

Although patterns of somatic alterations have been reported for tumor genomes, little is known on how they compare with alterations present in non-tumor genomes. A comparison of the two would be crucial to better characterize the genetic alterations driving tumorigenesis. We sequenced the genomes of a lymphoblastoid (HCC1954BL) and a breast tumor (HCC1954) cell line derived from the same patient and compared the somatic alterations present in both. The lymphoblastoid genome presents a comparable number and similar spectrum of nucleotide substitutions to that found in the tumor genome. However, a significant difference in the ratio of non-synonymous to synonymous substitutions was observed between both genomes (P = 0.031). Protein-protein interaction analysis revealed that mutations in the tumor genome preferentially affect hub-genes (P = 0.0017) and are co-selected to present synergistic functions (P < 0.0001). KEGG analysis showed that in the tumor genome most mutated genes were organized into signaling pathways related to tumorigenesis. No such organization or synergy was observed in the lymphoblastoid genome. Our results indicate that endogenous mutagens and replication errors can generate the overall number of mutations required to drive tumorigenesis and that it is the combination rather than the frequency of mutations that is crucial to complete tumorigenic transformation.
Content may be subject to copyright.
Distinct patterns of somatic alterations in a
lymphoblastoid and a tumor genome derived
from the same individual
Pedro A. F. Galante
1
, Raphael B. Parmigiani
1
, Qi Zhao
2,3
, Ota
´via L. Caballero
3
,
Jorge E. de Souza
1
,Fa
´bio C. P. Navarro
1
, Alexandra L. Gerber
4
, Marisa F. Nicola
´s
4
,
Anna Christina M. Salim
1
, Ana Paula M. Silva
1
, Lee Edsall
5
, Sylvie Devalle
3
,
Luiz G. Almeida
4
, Zhen Ye
5
, Samantha Kuan
5
, Daniel G. Pinheiro
6
, Israel Tojal
6
,
Renato G. Pedigoni
6
, Rodrigo G. M. A. de Sousa
6
, Thiago Y. K. Oliveira
6
,
Marcelo G. de Paula
6
, Lucila Ohno-Machado
7
, Ewen F. Kirkness
2
, Samuel Levy
2
,
Wilson A. da Silva Jr
6
, Ana Tereza R. Vasconcelos
4
, Bing Ren
5
, Marco Antonio Zago
8
,
Robert L. Strausberg
2,3
, Andrew J. G. Simpson
3
, Sandro J. de Souza
1
and
Anamaria A. Camargo
1,
*
1
Ludwig Institute for Cancer Research, Sa
˜o Paulo Branch at Hospital Alema
˜o Oswaldo Cruz, Sa
˜o Paulo,
01323-903 SP, Brazil,
2
J. Craig Venter Institute, Rockville, 20850 MD, USA,
3
Ludwig Collaborative Group,
Department of Neurosurgery, Johns Hopkins University, Baltimore, 20850, MD, USA,
4
Laborato
´rio Nacional de
Computac¸a
˜o Cientı
´fica, Laborato
´rio de Bioinforma
´tica, Petro
´polis, 25651-075 RJ, Brazil,
5
Ludwig Institute for
Cancer Research, San Diego Branch, San Diego, 92093 CA, USA,
6
Departamento de Gene
´tica, Faculdade de
Medicina de Ribeira
˜o Preto, Universidade de Sa
˜o Paulo, Ribeira
˜o Preto, 14049-900 SP, Brazil,
7
Division of
Biomedical Informatics, University of California, San Diego, 92093 CA, USA and
8
Departamento de Clı
´nica
Me
´dica, Centro de Terapia Celular e Banco de Sangue, Universidade de Sa
˜o Paulo, Ribeira
˜o Preto, 14051-140
SP, Brazil
Received November 28, 2010; Revised March 22, 2011; Accepted March 25, 2011
ABSTRACT
Although patterns of somatic alterations have been
reported for tumor genomes, little is known on how
they compare with alterations present in non-tumor
genomes. A comparison of the two would be crucial
to better characterize the genetic alterations driving
tumorigenesis. We sequenced the genomes of a
lymphoblastoid (HCC1954BL) and a breast tumor
(HCC1954) cell line derived from the same patient
and compared the somatic alterations present in
both. The lymphoblastoid genome presents a com-
parable number and similar spectrum of nucleotide
substitutions to that found in the tumor genome.
However, a significant difference in the ratio of
non-synonymous to synonymous substitutions was
observed between both genomes (P= 0.031).
Protein–protein interaction analysis revealed that
mutations in the tumor genome preferentially
affect hub-genes (P= 0.0017) and are co-selected
to present synergistic functions (P<0.0001). KEGG
analysis showed that in the tumor genome most
mutated genes were organized into signaling
pathways related to tumorigenesis. No such
organization or synergy was observed in the
lymphoblastoid genome. Our results indicate that
endogenous mutagens and replication errors can
generate the overall number of mutations required
to drive tumorigenesis and that it is the combination
rather than the frequency of mutations that is crucial
to complete tumorigenic transformation.
INTRODUCTION
Somatic genetic alterations accumulate in the genome of
all dividing cells as a result of DNA replication errors or
exposure to mutagens. Some somatic alterations, known
as driver mutations, confer a selective growth advantage
*To whom correspondence should be addressed. Tel: +55 11 3388.3248; Fax: +55 11 3141.1325; Email: anamaria@compbio.ludwig.org.br
6056–6068 Nucleic Acids Research, 2011, Vol. 39, No. 14 Published online 14 April 2011
doi:10.1093/nar/gkr221
ßThe Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
to the cells and can cause cancer. Conversely, passenger
alterations are biologically neutral and are expected to
occur in normal genomes (1,2). Recent advances in
sequencing technologies have allowed a comprehensive
characterization of the genetic alterations occurring in in-
dividual tumors, including copy number variations
(CNVs), chromosomal rearrangements, point mutations
and small insertions and deletions (Indels) (3–11).
However, the frequency and characteristics of the
genetic alterations driving tumorigenesis are not currently
well-defined.
Although the sequences of six-matched tumor and
non-tumor genomes have been published (4–8,11), these
studies were limited to the identification of somatic alter-
ations present in the tumor genomes and there has been no
attempt to define the somatic alterations present in the
matched non-tumor genome. A comparison of the two
would be crucial to better characterize the genetic
changes that drive tumorigenesis, since we expect most,
if not all, alterations present in the non-tumor genome
to be passenger mutations.
Here, we have sequenced the genome of a breast tumor
cell line (HCC1954) and a lymphoblastoid cell line
(HCC1954BL) derived from the same patient and
compared the genetic variations occurring in both.
HCC1954 is an immortal, pseudo-tetraploid (12), publicly
available tumor cell line derived from a hormone receptor
negative, ERBB2 positive, primary breast tumor from a
61-year-old female patient. HCC1954BL is an Epstein-
Barr virus (EBV)-transformed lymphoblastoid cell line
derived from the same patient. Both cell lines received
similar treatments in terms of the timing of establishment
and in vitro propagation (36 passages) (13) and the
sequencing data revealed both lines to be clonal, permitting
the detection of somatic changes in both.
Using a combined sequencing approach, we were able
to characterize chromosomal rearrangements and single
nucleotide variations (SNVs) in the protein-coding
regions of the HCC1954 and HCC1954BL genomes. By
comparing the sets of rearrangements and SNVs present in
both genomes, we were able to exclude those that are
common to both genomes and correspond to inherited
variations and to identify those that are exclusive to
each genome and likely correspond to somatic alterations
that have independently accumulated in the genomes of
these cell lines during their lifespan. Thus, in the present
work, in addition to the identification of somatic muta-
tions present in the tumor genome, we have also identified
those present in the matching lymphoblastoid genome and
used a system biology approach to better characterize the
functional differences between the set of altered genes
present in both genomes.
We found that the HCC1954BL genome contains very
few somatically acquired chromosomal rearrangements
but, surprisingly, present a comparable number of
somatic point mutations and a similar spectrum of nucleo-
tide substitutions to that found in the HCC1954 genome.
We also observed that, unlike in the HCC1954BL genome,
non-synonymous mutations present in the HCC1954
genome were not randomly distributed and were
co-selected to present synergistic tumor-promoting
functions and to affect hub-genes in functional pathways
related to tumorigenesis. Our results provide important
insights into the normal mutational processes and into
the functional implications of the accumulation of
somatic mutations in a tumor and a matching
lymphoblastoid genome.
MATERIALS AND METHODS
Cell lines and DNA extraction
HCC1954 and HCC1954BL cell lines were obtained from
American Type Culture Collection (ATCC) and were
maintained in RPMI medium containing 10% fetal
bovine serum (FBS) and non-essential amino acids. Both
cell lines received similar treatments in terms of the timing
of establishment and in vitro propagation and present
similar proliferation rates (data not shown). Both of
them were received from ATCC at passage 25 after
in vitro establishment and were maintained in culture
until passage 36 when DNA was extracted for sequencing.
DNA was isolated using the DNeasy Blood & Tissue Kit
(Qiagen, Valencia, CA, USA). Genomic DNA was treated
with RNase Cocktail
TM
(Ambion Austin, TX, USA),
followed by phenol–chloroform extraction and precipita-
tion of the aqueous phase in 1/10 volume 3 M sodium
acetate, pH5.2 and 100% ethanol.
Public genome data
The human reference genome sequence (NCBI build
36.1/hg18) was downloaded from the UCSC Genome
Browser (http://genome.ucsc.edu). Alternative haplotype
regions, including the immunoglobulin loci, were
excluded from the reference sequence because of their
highly polymorphic and rearranged structure. Human ref-
erence mRNA sequences (RefSeqs for coding mRNAs,
‘NM’ and non-coding mRNAs,‘NR’) were downloaded
from NCBI (http://www.ncbi.nlm.nih.gov/RefSeq) and
mapped to the human genome as previously described
(14). Known SNPs were also downloaded from the
UCSC Genome Browser (dbSNP version 130) and
loaded into a MySQL database.
gDNA paired-end sequencing and mapping
A total of 5 mg of genomic DNA was randomly sheared
using a Biorupter according to the manufacturer’s instruc-
tions. The fragmented DNA was end-repaired using
Klenow and T4 DNA polymerases and phosphorylated
at the 50-end with T4 polynucleotide kinase. A 30
overhang was created using 30–50exonuclease-deficient
Klenow fragment and Illumina paired-end adaptor oligo-
nucleotides were ligated to the created sticky ends. DNA
fragments of 200 bp were size selected in 8% polyacryl-
amide gels and eluted from the gel overnight. Size-selected
DNA was Polymerase Chain Reaction (PCR) amplified
for 18 cycles to enrich for adapter-modified DNA frag-
ments. A paired-end flow cell was prepared on the
supplied cluster station according to the manufacturer’s
protocol. Clusters of PCR colonies were then sequenced
on the Illumina GAII sequencing platform using the
Nucleic Acids Research, 2011, Vol. 39, No. 14 6057
recommended protocols. Images from the instrument were
processed to generate sequence files in FASTAQ format.
Sequences were aligned to the human genome reference
sequence (NCBI build 36.1/hg18) using Bowtie version
0.12.2 (15), allowing two mismatches in mapping.
Duplicated read pairs with identical coordinates were
merged, and only unambiguously mapped reads were
used for the structural variation analysis.
Exome capture, sequencing and mapping
A total of 34.1 Mb of the human genome corresponding to
approximately 180 000 coding exons (28.4 Mb) and
adjacent intergenic and intronic regions (5.7 Mb) was
captured using the Nimblegen Sequence Capture 2.1 M
Human Exome array v1.0. Briefly, genomic DNA
samples were fragmented, ligated to adapters and
hybridized to the exome capture array. Unbound frag-
ments were washed away and the target-enriched DNA
was eluted. Enriched samples were amplified by Ligation-
Mediated PCR (LM-PCR). The adapters used during en-
richment were designed for direct integration into the
workflow of the 454 GS FLX instrument, eliminating the
library construction process. We continued the protocol
from the DNA library quantification and emulsion PCR
according to manufacturer’s protocol. Sequencing was per-
formed using the 454 GS FLX Titanium platform.
Sequences were aligned to the human genome reference
sequence (NCBI build 36.1/hg 18) using BLAT program
(parameters: noTrimA –tileSize = 12 –minScore = 200
minIdentity = 92 –out = pslx) (16). Duplicated reads
with identical coordinates were merged, and only unam-
biguously mapped reads were used for SNV calling and
mutation detection.
Data deposition and availability
Quality scores and FASTA sequences generated for
HCC1954 and HCC1954BL were uploaded to the Short
Read Archive under the accession numbers ERA010917
and ERA011762 and are publicly available.
Structural variation analysis
Illumina paired-end reads that failed to align to the
human genome reference sequence in the expected orien-
tation or distance were used in the structural variation
analysis after removing those that mapped to highly re-
petitive regions within 1 Mb of a centromeric or telomeric
region. Reads for which the two ends aligned to within
the expected distance, but with one of the two ends in the
incorrect orientation were also excluded from the analysis
because these reads are likely to be artifacts generated by
mispriming of the Illumina sequencing oligonucleotide or
intramolecular rearrangements generated during library
amplification. Structural variants were called from
Bowtie alignments of genomic paired-end sequences,
requiring at least five independent read pairs in
HCC1954 and no read pairs representing the rearrange-
ment in HCC1954BL and vice versa. Interchromosomal
rearrangements were called when each read from the same
pair was mapped uniquely, but in distinct chromosomes.
Intrachromosomal deletions were called when the
read-pairs mapped in an expected order and orientation,
but in a distance greater than the expected
(average+4*SD). Intrachromosomal tandem duplications
were called when read-pairs mapped in an expected orien-
tation, but in an unexpected order and distance.
Intrachromosomal inversions were called when read-pairs
mapped in an expected order, but in an unexpected orien-
tation and distance (for a review, see Ref. 17). Structural
variants represented by read pairs that mapped to within
500 bp of a previously identified copy number polymorph-
ism were removed from the final list of somatic chromo-
somal rearrangements.
SNVs and point mutation analysis
SNVs and somatic point mutations were independently
called for each cell line from BLAT alignments of
capture sequences to the human reference genome. SNVs
supported by at least three reads with base quality 20
were called for the HCC1954 and HCC1954BL genomes.
To assess the sensitivity of our SNV detection strategy, we
have compared our SNV calls to SNV calls extracted from
genotyping array data available for both cell lines (GEO:
GSE12019 and GSE13373), as previously described (5).
SNVs common to both genomes and/or already described
in dbSNP were excluded from the somatic point mutation
analysis, since they likely correspond to inherited sequence
variants. Somatic point mutations were then identified,
requiring at least three high-quality independent reads
(3 reads with Phred score 20 and 1 read with Phred
score 30) reporting the variant in HCC1954 and no reads
reporting the variant in HCC1954BL and vice versa.
Variant reads were required to represent at least 20% of
the total number of reads covering the variant genomic
position to filter somatic mutations that might have even-
tually arisen during in vitro propagation of both cell lines
and present in a small subpopulation of the cells. A depth
of at least 5was required to assure sufficient coverage in
both cell lines. We also excluded from the point mutation
analysis false mutation calls residing in regions of
loss-of-heterozygosity (LOH) in either of the cell lines.
Briefly, the zygosity of approximately 250 000 known
SNPs represented in the Affymetrix SNP array was
determined for both cell lines using hybridization data
available in public databases (GEO: GSE12019 and
GSE13373). LOH regions (where SNPs were heterozygous
in the normal but homozygous in the tumor and vice
versa) were identified using Hidden Markov Model algo-
rithms as previously described (7,8). Results were
manually inspected and exome sequence data were used
to confirm the SNP array analysis and to increase reso-
lution in regions of low-SNP density represented in the
Affymetrix SNP array.
Ratio of non-synonymous to synonymous substitutions
Monte Carlo simulations were used to determine if the
ratio of non-synonymous to synonymous substitutions
(dN/dS) observed for HCC1954 and HCC1954BL were
significantly different from that expected by chance (null
hypothesis). In these simulations, values expected by
change were obtained from 1000 random sets of
6058 Nucleic Acids Research, 2011, Vol. 39, No. 14
64 mutations for HCC1954 and 30 mutations for
HCC1954BL occurring in the coding region of known
human RefSeq genes. To calculate if the difference
between the dN/dS ratios observed for HCC1954 and
HCC1954BL was significant, we performed a
2
test
using the HCC1954BL dN/dS values as expected values
for HCC1954.
PCR and Sanger sequencing validation
Primers for validation were designed to target regions im-
mediately flanking point mutations using Primers3
software (http://frodo.wi.mit.ed/primer3/). PCR amplifi-
cation was performed on HCC1954 and HCC1954BL
genomic DNA using Taq Platinum Hi-Fidelity
Polymerase (Invitrogene), following standard protocols.
PCR product purity and size were assessed on 2%
agarose gels stained with GelRed (Biotium). Sanger
sequencing was performed using an ABI3130 Capillary
DNA Analyzer. Sequence trace files were manually
analyzed for point mutations. All confirmations were per-
formed using both HCC1954 and HCC1954BL DNA to
determine, if the variants were somatic or germline.
Validated non-synonymous mutations were classified as
non-conservative if they result in changes in amino acid
charge and polarity. Functional proteins domains were
determined using Pfam and HMMER package
(E<0.001). A Perl-script was used to cross information
on the position of non-synonymous mutations and func-
tional domains. Evolutionary conserved amino acid
residues were obtained from UCSC Genome Browser
Conservation Track (http://genome.ucsc.edu). Amino
acid conservation between 10 different species, Pan trog-
lodytes (Chimp), Macaca mulatta (Rhesus), Mus musculus
(mouse), Rattus norvegicus (rat), Canis lupus familiaris
(Dog), Loxodonta africana (elephant), Monodelphis
domestica (Opossum), Gallus gallus (chicken) and
Takifugu rubripes (fugu), was manually determined.
KEGG and protein–protein interactions analysis
The list of genes carrying non-synonymous mutations
validated by Sanger sequencing for each cell line was
uploaded to the Kyoto Encyclopedia of Genes and
Genomes (KEGG) database for the pathway enrichment
analysis. Genes carrying non-synonymous mutations were
used since these mutations are more likely to have a func-
tional impact when compared to synonymous mutations.
Monte Carlo simulations were performed in which 1000
random sets of 45 genes and 12 genes were evaluated re-
garding KEGG pathway enrichment to address whether
the obtained results were significantly different (P<0.05)
from those expected by chance. Simulations were per-
formed using all known coding genes and KEGG
pathways (200 annotated pathways). For the protein–
protein interactions (PPI) analysis, data from the follow-
ing databases were merged: MINT (December 2007
version), BIOGRID (2.0.37), INTACT (January 2008
version), HPRD (September 2007 version), BIND (May
2006 version) and DIP (January 2008 version). These
databases include both high-throughput experiments and
low-throughput ones curated from the literature. When
a field was present for the technique used to discover
the interaction, those registers with an entry that
referred to one of the several mass spectrometry-based
methods were excluded to avoid including indirect inter-
actions. Exclusively functional interactions, for example,
numerous exclusively genetic interactions present in
BIOGRID, were also excluded. Only genes carrying
non-synonymous mutations validated by Sanger
sequencing were used in the PPI analysis. Monte Carlo
simulations were performed in which 10 000 random sets
of 25 and 8 genes with known PPI were evaluated regard-
ing the degree of interaction and the number of common
interaction partners among mutated proteins to address
whether the obtained results were significantly different
(P<0.05) from those expected by chance.
RESULTS
Sequencing strategy and genome coverage
A combined sequencing strategy was used to characterize
chromosomal rearrangements and point mutations in
protein-coding regions of the HCC1954 and
HCC1954BL genomes (Figure 1). Approximately 381
million and 347 million 35–75 bp paired-end reads from
200 bp DNA inserts were generated for HCC1954 and
Figure 1. Sequencing strategy. Outline of the sequencing strategy and
bioinformatics algorithms used for the identification of point mutations
and structural chromosomal rearrangements in the HCC1954 and
HCC1954BL genomes.
Nucleic Acids Research, 2011, Vol. 39, No. 14 6059
HCC1954BL, respectively, using an Illumina GAII
sequencing platform (Figure 1 and Table 1). For
HCC1954, 254 million paired-end reads were unambigu-
ously mapped to the reference genome, which based on the
average insert size of 200 bp (Supplementary Figure S1)
generated 8physical coverage. Similar numbers were
generated for HCC1954BL (237 million mapped
paired-end reads and 8physical coverage). Paired-end
reads that aligned discordantly with respect to each
other on the reference genome were then used to detect
chromosomal rearrangements in the HCC1954 and
HCC1954BL genomes (Figure 1).
In addition, the NimbleGen Sequence Capture Human
Exome Array was used to capture 34.1 Mb of the human
genome corresponding to approximately 180 000 coding
exons (28.4 Mb) and adjacent regions (5.7 Mb).
Approximately, 2.3 Gb of unambiguously mapped se-
quences were generated for each cell line and over 32%
of the bases mapped to the targeted regions (Figure 1 and
Table 1). The average fold-coverage was 22.8 for
HCC1954 and 21.7 for HCC1954BL (Supplementary
Figure S2). Captured sequences mapped to the reference
genome were then used for the detection of SNVs and
somatic point mutations in protein-coding exons and
adjacent regions from both genomes (Figure 1). For
HCC1954, 95.2% of targeted bases were covered at least
once and 92% met our criteria for SNV calling. Similar
numbers were obtained for HCC1954BL (95.7% of
targeted bases covered at least once and 93.8% met our
criteria for variant calling).
Chromosomal rearrangements
We used a paired-end sequencing strategy to identify
structural chromosome variations present in the
HCC1954 and HCC1954BL genomes (18). Paired-end
reads that aligned discordantly with respect to each
other, on the reference human genome, were identified
for HCC1954 (76 407 paired-end reads) and
HCC1954BL (55 967 paired-end reads). From this set of
aligned reads, we excluded those that precisely duplicated
other sequences derived from the same library, those that
could not be unumbiguosly mapped to the human
genome, and those that mapped within 1 Mb of a telomer-
ic or centromeric sequence gap. Structural variants were
called from the remaining paired-end sequences, requiring
at least five independent read pairs exclusively reporting
the variant in one of the cell lines. Structural variants
represented by read pairs that mapped to within 500 bp
of a known copy number polymorphism were removed
from the final list of somatic chromosomal rearrange-
ments (Figure 1).
A total of 94 structural rearrangements were detected in
HCC1954 including 49 interchromosomal events, 30 dele-
tions, 11 inversions and 4 duplications (Table 2 and
Figure 2). Of the 49 interchromosomal events, 38
affected genic regions and 22 had already been described
for HCC1954 (10,19). In contrast, no interchromosomal
rearrangements were detected in the HCC1954BL genome
and all four intrachromosomal rearrangements, including
two deletions and two inversions, were located in
intergenic or intronic regions (Table 2 and Figure 2).
Variant calling and somatic point mutations
Captured sequences (on-target and off-target) mapped to
the reference human genome were used for variant calling
and point mutation detection in coding regions and
adjacent sequences. A total of 82 355 and 83 474 SNVs
supported by at least three reads with base quality 20
were called for the HCC1954 and HCC1954BL genomes,
Table 2. Somatic point mutations and structural variations in the
HCC1954 and HCC1954BL genomes
Somatic variations HCC1954 HCC1954BL
N(%) N(%)
Point mutations 274 (100) 173 (100)
Coding 64 (23.36) 30 (17.3)
Nonsense 2 (0.73) 3 (1.7)
Missense 45 (16.42) 15 (8.7)
Synonymous 17 (6.20) 12 (6.9)
Non-coding 14 (5.11) 15 (8.7)
UTR 13 (4.74) 13 (7.5)
ncRNA 1 (0.36) 2 (1.2)
miRNA 0 (0) 0 (0)
Intronic 179 (65.33) 114 (65.9)
Splice site 0 (0) 0 (0)
Other intronic 179 (65.33) 114 (65.9)
Intergenic 17 (6.20) 14 (8.1)
Structural variations 94 (100) 4 (100)
Interchromosomal 49 (52.1) 0 (0)
Intrachromosomal 45 (47.9) 4 (100)
Deletions 30 (31.9) 2 (50.0)
Inversions 11 (11.7) 2 (50.0)
Duplications 4 (4.3) 0 (0)
UTR = untranslated region, ncRNA = non-coding RNA.
Table 1. Summary of sequence generation and mapping to the reference human genome sequence for the HCC1954 and HCC1954BL cell lines
HCC1954 HCC1954BL
Capture sequencing Paired-end sequencing Capture sequencing Paired-end sequencing
Total number of reads 5 996 389 381 274 888 6 265 250 347 891 568
Mapped reads 5 212 428 254 326 859 5106 763 237 886 727
Percentage of mapped reads 86.9 66.7 81.5 68.4
Total number of nucleotides 3 143 589 263 19 392 752 128 3 252 428 887 15 693 171 704
Mapped nucleotides 2 257 027 363 13 432 965 012 2 175 120 803 11 166 288 816
Percentage of mapped nucleotides 71.8 69.3 66.7 71.1
6060 Nucleic Acids Research, 2011, Vol. 39, No. 14
Figure 2. Circos plot representing somatic point mutations and structural variations in the (A) HCC1954 and (B) HCC1954BL genomes. Chromosome representations are shown around the outer
ring and are oriented in a clockwise direction. Other tracks contain (from outside to inside) point mutations as dots (non-synonymous labeled in back and synonymous labeled in red), physical
coverage of the genome by paired-end reads in green, interchromosomal rearrangements represented by colored lines linking two chromosomes (different colors representing interchromosomal
rearrangements are determined by the first chromosome in the circos in the clockwise direction starting with chromosome 1), intrachromosomal deletions as blue lines, inversions as black lines and
duplications as gray lines.
Nucleic Acids Research, 2011, Vol. 39, No. 14 6061
respectively (Table 3). As expected, most of the SNVs were
common to both genomes and the majority (92%) of these
inherited variants has already been described in dbSNP.
The rate of novel variant discovery (8%) for this individ-
ual is consistent with other published whole human
genome sequences (20–24).
To assess the coverage depth and sensitivity of our SNV
calling strategy, we compared our SNV calls to SNV calls
extracted from genotyping array data available for both
cell lines. (GEO: GSE12019 and GSE13373) as previously
described (5). In total, 93.7% and 97.8% of the heterozy-
gous array calls located within the captured regions were
sequenced at least once for HCC1954 and HCC1954BL,
respectively, and that 80.8% and 83.3% of the heterozy-
gous array calls for HCC1954 and HCC1954BL, respect-
ively, were correctly identified by our sequencing strategy
and SNV calling criteria. The difference in the SNV calling
efficiency observed between both cell lines is not statistic-
ally significant (P= 0.69,
2
= 0.16, df = 1) and together
these results demonstrated that both genomes were suffi-
ciently and equally covered for SNV calling and mutation
detection.
To identify SNVs that are specific to each cell line and
likely to correspond to somatic point mutations occurring
in each genome, we excluded those already described as a
known SNP in dbSNP and established a set of stringent
filtering criteria (Figure 1). Briefly, variants had to be rep-
resented by at least three high-quality reads from one cell
line and no reads from the other. Variant reads were
required to correspond to at least 20% of the total
number of reads covering the variant position to eliminate
point mutations that might have eventually arisen during
in vitro culturing and a depth of at least 5for the variant
base was also required for each cell line to assure sufficient
coverage in both cell lines. Variants for HCC1954BL were
further filtered to remove those residing in regions of LOH
in HCC1954 (see ‘Materials and Methods’ section).
A total of 274 point mutations were predicted in the
HCC1954 genome of which 64 (23.4%) occurred in
protein-coding regions, 14 (5.1%) in non-coding regions,
179 (65.3%) in intronic regions and 17 (6.2%) in
intergenic regions (Figure 2 and Table 2). Of the 64
point mutations occurring in coding regions, 47 (73.4%)
were predicted to cause amino acid changes
(non-synonymous), including 45 that were missense and
two that were nonsense.
The same set of criteria was used to predict mutations
present in the HCC1954BL genome. A total of 173 point
mutations were predicted for HCC1954BL, of which 30
(17.3%) occurred in protein-coding regions, 15 (8.7%) in
non-coding regions, 114 (65.9%) in intronic regions and
14 (8.1%) in intergenic regions (Figure 2 and Table 2). Of
the 30 point mutations occurring in coding regions, 18
(60%) were non-synonymous, including 15 that were
missense and 3 that were nonsense.
Non-synonymous substitutions and mutation spectrum
Since some of the mutations present in the HCC1954
genome could have arisen during normal breast develop-
ment as a result of the normal mutational process, we next
sought to determine if the set of mutations occurring in
the HCC1954 genome is enriched for driver mutations
and/or occur in genes that are functionally related to the
tumorigenesis. To address this issue, we first analyzed the
ratio of non-synonymous to synonymous substitutions
(dNs/dS) in the protein-coding regions of both genomes.
The dNs/dS ratio is commonly used to estimate the degree
of selection on non-synonymous changes, assuming that
most synonymous mutations are biologically neutral. This
ratio for HCC1954 is 2.8 (P= 0.37, Monte Carlo simula-
tion) and for HCC1954BL is 1.5 (P= 0.18, Monte Carlo
simulation), not significantly different from that expected
by chance. However, it is notable that the difference in the
non-synonymous/synonymous ratio between the cell lines
is statistically significant (P= 0.031;
2
= 4.68; df = 1),
indicating that non-synonymous mutations are more
frequent in HCC1954 than in HCC1954BL. We also
investigated the type of nucleotide changes present in
HCC1954 and HCC1954BL genomes. Both genomes pre-
sented a similar spectrum of nucleotide substitutions, with
a predominance of transitions, which change a purine to
purine (A$G) or pyrimidine to pyrimidine (C$T)
(Figure 3).
Sanger validation and KEGG analysis
To better characterize the genetic changes that drive
tumorigenesis, we validated by capillary sequencing all
non-synonymous point mutations present in the
HCC1954 and HCC1954BL genomes. Of the 47
non-synonymous mutations present in the HCC1954
genome, 33 (70.2%) have already been described in the
literature and 12 out of the 14 (85.7%) novel non-
synonymous mutations were confirmed by capillary
sequencing (Supplementary Table S1). Of the 18 non-
synonymous point mutations predicted for HCC1954BL,
12 (66.6%) were also detected by capillary sequencing
(Supplementary Table S2). Of the 45 non-synonymous
mutations identified for HCC1954, 29 (64.4%) result in
non-conservative amino acid changes, 19 (42.2%) occur
within functional protein domains and 42 (93.3%) occur
in evolutionary conserved amino acids residues.
For HCC1954BL 8 (66.6%) of the 12 non-synonymous
Table 3. Single nucleotide variations identified in the HCC1954 and
HCC1954BL genomes
HCC1954 HCC1954BL
N(%) in dbSNP N(%) in dbSNP
Substitutions 82 355 (92.68) 83 474 (93.60)
Coding 11 717 (90.92) 12 373 (93.84)
Intronic 60 314 (92.53) 61 428 (93.77)
UTR 3419 (92.57) 3570 (94.04)
ncRNA 256 (96.87) 260 (96.92)
Intergenic 6649 (91.84) 5843 (90.86)
Indels 689 (52.10) 587 (52.81)
Coding 38 (50.00) 31 (51.61)
Intronic 595 (52.43) 506 (54.15)
UTR 30 (46.66) 26 (42.30)
ncRNA 1 (100.00) 1 (0.00)
Intergenic 25 (52.00) 23 (39.13)
UTR = untranslated region, ncRNA = non-coding RNA
6062 Nucleic Acids Research, 2011, Vol. 39, No. 14
mutations result in non-conservative amino acid changes,
6 (50%) occur within functional protein domains and 11
(91.66%) occur in evolutionary conserved amino acids
residues.
We next compared the sets of genes with validated
non-synonymous mutations present in the two genomes.
First, we determined whether these sets were enriched for
specific signaling pathways related to tumorigenesis.
According to KEGG (25), the set of mutated genes in
HCC1954 was significantly enriched for genes related to
apoptosis (P= 0.017, Monte Carlo Simulation), MAPK
signaling (P= 0.023, Monte Carlo Simulation), axon
guidance and cell migration (P= 0.033, Monte Carlo
Simulation) and main pathways altered in cancer
(P= 0.037, Monte Carlo Simulation). In contrast, no en-
richment for any specific pathway related to cancer was
observed for the set of genes mutated in HCC1954BL
(Table 4).
Protein–protein Interaction and functional networks
We also examined known PPI to investigate the organiza-
tion of mutated genes into functional networks. We first
determined for each genome the percentage of genes with
validated non-synonymous mutations that had at least
one known protein interaction with any other protein.
Similar percentages were obtained for both the
HCC1954 (55.5%, 25/45) and HCC1954BL (66.7%,
8/12) cell lines, indicating that there is no difference in
Table 4. KEGG pathway analysis for genes with validated non-synonymous mutations present in the HCC1954 and
HCC1954BL genomes
KEGG ID KEGG annotation Number of genes
in the pathway
Gene Name P-value
HCC1954
hsa05222 Small cell lung cancer 3 ITGA6
TP53
TRAF2
0.0003
hsa05410 Hypertrophic cardiomyopathy 2 ITGA6
MYH7
0.0167
hsa04210 Apoptosis 2 TP53
TRAF2
0.0169
hsa05414 Dilated cardiomyopathy 2 ITGA6
MYH7
0.0191
hsa04010 MAPK signaling pathway 3 ARRB1
TP53
TRAF2
0.0237
hsa00770 Pantothenate and CoA biosynthesis 1 DPYD 0.0325
hsa04360 Axon guidance 2 CFL2
SEMA3A
0.0335
hsa04614 Renin-angiotensin system 1 LNPEP 0.0372
hsa05200 Pathways in cancer 3 ITGA6
TP53
TRAF2
0.0375
HCC1954BL
hsa03440 Homologous recombination 1 EME1 0.0234
hsa00310 Lysine degradation 1 SETD2 0.0382
hsa04740 Olfactory transduction 2 OR51E2
OR2D2
0.0421
Figure 3. Spectrum of nucleotide substitutions in the HCC1954 and HCC1954BL genomes. Frequency of point mutations in each of the six possible
nucleotide substitution classes (A >C|T >G, A >G|T >C, A >T|T >A, G >A|C >T, G >C|C >G, G >T|C >A) observed in the HCC1954 (blue)
and HCC1954BL (orange) genomes.
Nucleic Acids Research, 2011, Vol. 39, No. 14 6063
terms of representation of the mutated genes for each cell
line in this PPI database (P= 0.729,
2
= 0.12, df = 1).
We then analyzed the average number of interactions
for each mutated protein because proteins with larger
numbers of interactions have been suggested to serve as
essential hubs of molecular pathways (26). Proteins
mutated in HCC1954 interact with a higher average
number of partners than proteins mutated in
HCC1954BL (avg 33.2 versus 5.1 proteins, Figure 4). To
exclude the possibility that the higher degree of interaction
observed for HCC1954 was due to the larger number of
mutated proteins in this cell line, a Monte Carlo simula-
tion was performed in which 10 000 random sets of 25
proteins with known PPI were evaluated regarding their
degree of interaction. Only 17 out of the 10 000 simulated
sets had a higher average degree than that observed for
mutated proteins in HCC1954 (P= 0.0017, Monte Carlo
simulation), indicating that the number of interactions
observed for proteins mutated in HCC1954 was signifi-
cantly different from that expected by chance. The same
strategy showed that the average degree of interaction for
proteins mutated in HCC1954BL was similar to that
expected by chance (P= 0.875, Monte Carlo
Simulation). To confirm that the differences in the
degree of interaction observed between both cell lines
were not influenced by the smaller number of mutated
proteins in HCC1954BL, a Monte Carlo simulation was
again performed in which 1000 random sets of 5 proteins
from HCC1954 carrying non-synonymous mutations and
presenting known PPI were evaluated regarding the
average number of interactions. The number obtained is
higher than that obtained for the five mutated proteins in
HCC1954BL (34.7 versus 5.1) and again different from
that expected by chance (P= 0.001, Monte Carlo
simulation).
Interestingly, genes related to apoptosis (TP53,TRAF2,
SLC25A5), MAPK signaling (TP53,ARRB1,TRAF2),
cell adhesion (ITGA6), cytoskeleton organization
(PCNT,CLIP1) and cell cycle (RFC4,PCNT) were
among the HCC1954 mutated proteins displaying a
higher number of interactions (10 interactions,
Figure 4). To evaluate if the higher degree of interaction
observed for HCC1954 was just a consequence of the
proteins mutated in this cell line belonging to molecular
pathways with higher connectivity, we selected all proteins
from all KEGG pathways containing at least one mutated
protein in HCC1954 and HCC1954BL (e.g. these sets
include all proteins from MAPK pathway for HCC1954
and all proteins from Lysine degradation pathway for
HCC1954BL) and calculated the average number of inter-
actions for both sets of proteins. The set for HCC1954
contained 2311 proteins with an average of 18.4
Figure 4. Protein–protein interactions networks for mutated genes in HCC1954 (A) and HCC1954BL (B). Proteins with validated non-synonymous
mutations are represented as red circles and each line represents a confident interaction. Interaction partners with mutated genes are represented in
green if they interact with three mutated proteins or in light blue if they interact with two mutated proteins.
6064 Nucleic Acids Research, 2011, Vol. 39, No. 14
interactions per protein. The set for HCC1954BL con-
tained 395 proteins with an average of 22.1 interactions
per protein. These values are not significantly different
from each other (P= 0.0921, t-test = 1.16875;
df = 503.76), indicating that the higher degree of inter-
action observed for HCC1954 is not a direct consequence
of the mutated proteins belonging to pathways with higher
connectivity.
We also looked for common interaction partners among
the mutated proteins because mutated proteins with
common partners would probably have synergistic
tumor-promoting functions (27). Approximately 68%
(17/25) of the mutated proteins in HCC1954 shared at
least one common partner, as opposed to none of the
mutated proteins in HCC1954BL (Figure 4). Simulations
to correct for the difference in the number of mutated
proteins in both genomes, revealed that the number of
mutated proteins sharing a common partner was signifi-
cantly different from that expected by chance for
HCC1954 (P<0.0001, Monte Carlo simulation) but not
for HCC1954BL (P= 0.855, Monte Carlo simulation).
Again, to confirm that the differences in the number of
common interaction partners observed between both cell
lines were not influenced by the smaller number of
mutated proteins in HCC1954BL, a Monte Carlo simula-
tion was again performed in which 1000 random sets of 5
proteins from HCC1954 carrying non-synonymous muta-
tions and presenting known PPI were evaluated regarding
the average number of common interaction partners. The
number obtained is higher than that obtained for the five
mutated proteins in HCC1954BL (3.3 versus 0) and again
different from that expected by chance (P= 0.0245,
Monte Carlo simulation).
A total of 64 common partners were identified for
proteins mutated in HCC1954, of which 51, 10 and 3
interact with 2, 3 and 4 mutated proteins in HCC1954,
respectively. Again, the number of common partners
observed for HCC1954 was significantly different from
that expected by chance (P<0.0001, Monte Carlo simu-
lation). Key cancer genes such as BRCA1,CDC42,
CHECK1,MDM2,MAP3K1/3 and SMAD2/3 were
among the 64 common interaction partners (Figure 4).
Finally, we also investigated the organization of
mutated proteins into functional networks in other
tumor genomes recently sequenced. Similar patterns of
synergy were observed for the set of genes carrying
non-synonymous point mutations in melanoma, glioblast-
oma, lung and breast-tumor genomes (Table 5). Although
the average number of interactions for mutated proteins
varied significantly between the different tumors analyzed
(8.1–32.5 interactions), the percentage of mutated genes
with common partners was similar among all tumors
analyzed (varying from 41% to 66%) and different from
that expected by chance (Table 5). Among the five tumors
analyzed, the estrogen receptor-positive metastatic lobular
breast cancer was the one presenting the most similar
functional organization to HCC1954 with an average of
32.5 interactions for each mutated protein, 44% of the
mutated proteins with common partners and 28
common partners among the mutated proteins. All these
values are significantly different from that expected by
chance (P= 0.0034, P= 0.0001 and P= 0.0013, respect-
ively, Monte Carlo Simulations).
DISCUSSION
In this study, we were able to identify, for the first time to
our knowledge, the somatically acquired genetic alter-
ations present in the genome of a lymphoblastoid and a
tumor cell derived from the same individual. By also
characterizing the somatic mutations present in the
lymphoblastoid genome, we were able to show how a
non-tumor somatic tissue evolves over the same timescale
and under similar environmental conditions when
compared to a tumor genome, providing important
insights into the normal mutational processes and into
the functional implications of the accumulation of
somatic alterations in a tumor and a matching
lymphoblastoid genome.
A highly complex pattern of chromosomal rearrange-
ments was exclusively observed in the tumor genome with
most of these rearrangements affecting genic regions. In
agreement with these observations, in a previous study in
which exome and transcriptome data from HCC1954 and
HCC1954BL were combined to identify genes with LOH
and allele-specific expression (ASE), large LOH regions,
harboring genes with ASE and known tumor suppressor
characteristics, were only detected in the tumor genome
(28). Together these observations support the concept that
chromosomal instability is a key feature of human cancer
and a driving force of tumorigenesis (29).
Table 5. Protein–protein interaction analysis for genes with non-synonymous mutations in other solid tumors
References Tumor type Number of genes
with non-synonymous
mutations
Number of mutated
genes with PPI
information (%)
Average number of
interactions for
mutated genes
(P-value)
Number of mutated
genes with common
partner (%)
(P-value)
Number of
common
partners
(P-value)
Pleasance et al. (8) Lung 90 50 (56) 11.6 (0.2692) 33 (66) (0.0001) 42 (0.0870)
Pleasance et al. (7) Melanoma 188 100 (53) 8.3 (0.8344) 69 (69) (0.0001) 103 (0.3130)
Ding et al. (4) Breast basal 29 17 (59) 8.1 (0.2210) 7 (41) (0.0001) 7 (0.0132)
Shah et al. (9) Breast lobular 32 16 (50) 32.5 (0.0034) 7 (44) (0.0001) 28 (0.0011)
Clark et al. (3) GBM 110 40 (36) 12.9 (0.7269) 18 (45) (0.0001) 13 (0.1896)
Galante et al.
(this study)
Breast HCC1954 45 25 (56) 33.2 (0.0017) 17 (68) (0.0001) 64 (0.0001)
Nucleic Acids Research, 2011, Vol. 39, No. 14 6065
Interestingly, the number of somatically acquired point
mutations and the spectrum of nucleotide substitutions
found in the lymphoblastoid genome were comparable
to that present in the tumor genome. It has been
proposed that normal point mutation rates are insufficient
to account for all point mutations observed in tumors and
that tumor cells must acquire a mutator phenotype, which
increases the accumulation of point mutations in the
tumor genome (30). The number of point mutations
present in both genomes analyzed is in agreement with
the estimated spontaneous mutation rate of normal
human cells (2.10–10/bp per cell division) (31) and the
difference in the total number of point mutations observed
for the HCC1954 and HCC1954BL genomes (274/
173 = 1.58) is probably due to the higher DNA content
of the tumor cell line rather than to the existence of a
mutator phenotype (30). Both cell lines also presented a
similar spectrum of nucleotide substitutions with a pre-
dominance of transitions. A similar frequency of
G>A|C >T transitions was observed for other recently
reported breast-tumor genomes (4,9) and a comparison to
the spectrum of nucleotide substitutions reported for a
melanoma and a lung cancer cell line indicates that
neither genome under study had signs of the influence of
external mutagens such as tobacco or ultraviolet light
(7,8). Together these results indicate that the influence of
endogenous mutagens and replication errors are sufficient
to generate the overall number of point mutations
required to drive tumorigenesis and that tumor cells do
not necessarily need to acquire a mutator phenotype to
increase the accumulation of point mutations in their
genomes.
Although, we cannot completely discard the possibility
that some of the somatic point mutations detected in
HCC1954 and HCC1954BL genomes result from in vitro
culturing and EBV transformation (in the case of
HCC1954BL), we do not expect these mutations to be
representative or to influence our observations and
overall conclusions. Both cell lines received similar treat-
ments in terms of the timing of establishment and in vitro
propagation and have not been maintained in culture for a
long period (36 passages), reducing the probability of
introducing mutations during the culturing process. We
have also set up very stringent criteria for somatic point
mutation detection to filter most, if not all, non-clonal
mutations eventually introduced during in vitro culturing
of these cell lines (see ‘Materials and Methods’ section).
Moreover, it has been recently demonstrated that clonal
point mutations rarely arise during in vitro or in vivo ex-
perimental growth of tumor cells (32). Jones et al. have
analyzed 289 mutations present in 18 cell lines or xeno-
grafts, each derived from a different primary tumor.
Over 99% of these mutations were also present in the
primary tumors (29). Finally, several studies have
demonstrated a very strong correlation when DNA from
EBV-transformed lymphoblastoid cell lines was compared
to DNA from the corresponding blood samples (33),
indicating that EBV transformation does not substantial-
ly alters the mutation rate and genetic stability of trans-
formed cells. Indeed EBV-transformed lymphoblastoid
cell lines have been widely used as source of DNA in
genetic screening studies and have also been used as
source of normal DNA in whole genome studies on the
occurrence of somatic mutations in tumor genomes (7,8).
We also do not expect our observations and conclusions
to be significantly affected by sequencing errors. Since one
of our criteria for point mutation detection was to have at
least 20% of the reads reporting the mutated allele, this
represents an average of four reads considering the
average exome coverage of 22. If we use a false-positive
error rate of 2% for each position per read for the 454
sequencing platform (34), our error rate per position
would be 16 per 100 Mb. Since the exome captured
region comprises 31 Mb, the expected number of false
positive mutations per genome is between 4 and 5 muta-
tions. Nevertheless, it is important to emphasize that we
took a conservative approach and used for all downstream
analysis (KEGG and PPI) only those mutations that were
further validated by Sanger sequencing.
Significant functional differences were observed
between the set of genes mutated in both cell lines,
indicating that mutations in the tumor genome are not
randomly distributed. We showed that non-synonymous
point mutations are more frequently found in the tumor
genome and that they preferentially affect hub-genes in
molecular pathways related to tumorigenesis. Moreover,
by looking for common interaction partners among
mutated proteins, we demonstrated, for the the first
time, that mutations in the tumor genome are co-selected
to present synergistic tumor-promoting functions. This
observation was further extended to other individual
tumor genomes sequenced. Similar patterns of synergy
were observed for the set of genes carrying
non-synonymous point mutations in melanoma, glioblast-
oma, lung and breast-tumor genomes.
The functional differences observed between the sets of
genes mutated in the lymphoblastoid and tumor genomes
raise questions regarding the number and strength of the
driver genetic alterations required for tumorigenesis. If a
tumor cell were to require just a small number of ‘strong’
driver alterations, we would not expect to see the striking
functional association of the genes mutated in the tumor
because the majority of the mutations would be passen-
gers. Our results thus support the model in which the
tumor genome has a few ‘strong’ driver mutations and
several ‘weak’ driver mutations that act in a synergistic
fashion to disrupt molecular pathways related to tumori-
genesis (35). Although, this model has been proposed in
the literature for some time, to our knowledge, this is the
first time the existence of strong and weak driver muta-
tions is evidenced by comparing genome-wide mutation
data generated from tumor and non-tumor tissues
derived from the same individual. We would expect
these ‘weak’ drivers to be infrequently mutated and insuf-
ficient to form tumors in the absence of the strong drivers.
Distinguishing ‘weak’ drivers from passenger mutations
will require a systems biology approach to dissect the in-
dividual roles of mutated genes and examine their inter-
actions within the networks and pathways related to
tumorigenesis. Moreover, this approach will require the
analysis of large numbers of normal genome sequences
6066 Nucleic Acids Research, 2011, Vol. 39, No. 14
to identify the permutations of mutated genes that do not
result in tumorigenesis.
ACCESSION NUMBERS
SRA, ERA010917, ERA011762.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
The authors acknowledge Daniel Ohara and Jose
Eduardo Kroll for technical support.
FUNDING
The Ludwig Institute for Cancer Research; The Conrad N
Hilton Foundation; Conselho Nacional de
Desenvolvimento Cientı
´fico e Tecnolo
´gico—CNPq;
Fogarty International Center [D43TW007015]—National
Institutes of Health (to P.A.F.G., S.J.S., L.O.). Funding
for open access charge: Ludwig Institute for Cancer
Research.
Conflict of interest statement. None declared.
REFERENCES
1. Futreal,P.A., Coin,L., Marshall,M., Down,T., Hubbard,T.,
Wooster,R., Rahman,N. and Stratton,M.R. (2004) A census
of human cancer genes. Nat. Rev. Cancer,4, 177–183.
2. Greenman,C., Stephens,P., Smith,R., Dalgliesh,G.L., Hunter,C.,
Bignell,G., Davies,H., Teague,J., Butler,A., Stevens,C. et al.
(2007) Patterns of somatic mutation in human cancer genomes.
Nature,446, 153–158.
3. Clark,M.J., Homer,N., O’Connor,B.D., Chen,Z., Eskin,A.,
Lee,H., Merriman,B. and Nelson,S.F. (2010) U87MG decoded:
the genomic sequence of a cytogenetically aberrant human
cancer cell line. PLoS Genet.,6, e1000832.
4. Ding,L., Ellis,M.J., Li,S., Larson,D.E., Chen,K., Wallis,J.W.,
Harris,C.C., McLellan,M.D., Fulton,R.S., Fulton,L.L. et al.
(2010) Genome remodelling in a basal-like breast cancer
metastasis and xenograft. Nature,464, 999–1005.
5. Ley,T.J., Mardis,E.R., Ding,L., Fulton,B., McLellan,M.D.,
Chen,K., Dooling,D., Dunford-Shore,B.H., McGrath,S.,
Hickenbotham,M. et al. (2008) DNA sequencing of a
cytogenetically normal acute myeloid leukaemia genome.
Nature,456, 66–72.
6. Mardis,E.R., Ding,L., Dooling,D.J., Larson,D.E., McLellan,M.D.,
Chen,K., Koboldt,D.C., Fulton,R.S., Delehaunty,K.D.,
McGrath,S.D. et al. (2009) Recurring mutations found by
sequencing an acute myeloid leukemia genome. N. Engl. J. Med.,
361, 1058–1066.
7. Pleasance,E.D., Cheetham,R.K., Stephens,P.J., McBride,D.J.,
Humphray,S.J., Greenman,C.D., Varela,I., Lin,M.L.,
Ordonez,G.R., Bignell,G.R. et al. (2010) A comprehensive
catalogue of somatic mutations from a human cancer genome.
Nature,463, 191–196.
8. Pleasance,E.D., Stephens,P.J., O’Meara,S., McBride,D.J.,
Meynert,A., Jones,D., Lin,M.L., Beare,D., Lau,K.W.,
Greenman,C. et al. (2010) A small-cell lung cancer genome with
complex signatures of tobacco exposure. Nature,463, 184–190.
9. Shah,S.P., Morin,R.D., Khattra,J., Prentice,L., Pugh,T.,
Burleigh,A., Delaney,A., Gelmon,K., Guliany,R., Senz,J. et al.
(2009) Mutational evolution in a lobular breast tumour profiled
at single nucleotide resolution. Nature,461, 809–813.
10. Stephens,P.J., McBride,D.J., Lin,M.L., Varela,I., Pleasance,E.D.,
Simpson,J.T., Stebbings,L.A., Leroy,C., Edkins,S., Mudie,L.J.
et al. (2009) Complex landscapes of somatic rearrangement in
human breast cancer genomes. Nature,462, 1005–1010.
11. Lee,W., Jiang,Z., Liu,J., Haverty,P.M., Guan,Y., Stinson,J.,
Yue,P., Zhang,Y., Pant,K.P., Bhatt,D. et al. (2010) The mutation
spectrum revealed by paired genome sequences from a lung
cancer patient. Nature,465, 473–477.
12. Bignell,G.R., Santarius,T., Pole,J.C., Butler,A.P., Perry,J.,
Pleasance,E., Greenman,C., Menzies,A., Taylor,S., Edkins,S. et al.
(2007) Architectures of somatic genomic rearrangement in human
cancer amplicons at sequence-level resolution. Genome Res.,17,
1296–1303.
13. Gazdar,A.F., Kurvari,V., Virmani,A., Gollahon,L., Sakaguchi,M.,
Westerfield,M., Kodagoda,D., Stasny,V., Cunningham,H.T.,
Wistuba,I.I. et al. (1998) Characterization of paired tumor and
non-tumor cell lines established from patients with breast cancer.
Int. J. Cancer,78, 766–774.
14. Galante,P.A., Vidal,D.O., de Souza,J.E., Camargo,A.A. and
de Souza,S.J. (2007) Sense-antisense pairs in mammals: functional
and evolutionary considerations. Genome Biol.,8, R40.
15. Langmead,B., Trapnell,C., Pop,M. and Salzberg,S.L. (2009)
Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol.,10, R25.
16. Kent,W.J. (2002) BLAT–the BLAST-like alignment tool.
Genome Res.,12, 656–664.
17. Medvedev,P., Stanciu,M. and Brudno,M. (2009) Computational
methods for discovering structural variation with next-generation
sequencing. Nat. Meth.,6, S13–S20.
18. Campbell,P.J., Stephens,P.J., Pleasance,E.D., O’Meara,S., Li,H.,
Santarius,T., Stebbings,L.A., Leroy,C., Edkins,S., Hardy,C. et al.
(2008) Identification of somatically acquired rearrangements in
cancer using genome-wide massively parallel paired-end
sequencing. Nat. Genet.,40, 722–729.
19. Zhao,Q., Caballero,O.L., Levy,S., Stevenson,B.J., Iseli,C.,
de Souza,S.J., Galante,P.A., Busam,D., Leversha,M.A.,
Chadalavada,K. et al. (2009) Transcriptome-guided
characterization of genomic rearrangements in a breast cancer
cell line. Proc. Natl Acad. Sci. USA,106, 1886–1891.
20. Ahn,S.M., Kim,T.H., Lee,S., Kim,D., Ghang,H., Kim,D.S.,
Kim,B.C., Kim,S.Y., Kim,W.Y., Kim,C. et al. (2009) The first
Korean genome sequence and analysis: full genome sequencing
for a socio-ethnic group. Genome Res.,19, 1622–1629.
21. Bentley,D.R., Balasubramanian,S., Swerdlow,H.P., Smith,G.P.,
Milton,J., Brown,C.G., Hall,K.P., Evers,D.J., Barnes,C.L.,
Bignell,H.R. et al. (2008) Accurate whole human genome
sequencing using reversible terminator chemistry. Nature,456,
53–59.
22. Schuster,S.C., Miller,W., Ratan,A., Tomsho,L.P., Giardine,B.,
Kasson,L.R., Harris,R.S., Petersen,D.C., Zhao,F., Qi,J. et al.
(2010) Complete Khoisan and Bantu genomes from southern
Africa. Nature,463, 943–947.
23. Wang,J., Wang,W., Li,R., Li,Y., Tian,G., Goodman,L., Fan,W.,
Zhang,J., Li,J., Zhang,J. et al. (2008) The diploid genome
sequence of an Asian individual. Nature,456, 60–65.
24. Wheeler,D.A., Srinivasan,M., Egholm,M., Shen,Y., Chen,L.,
McGuire,A., He,W., Chen,Y.J., Makhijani,V., Roth,G.T. et al.
(2008) The complete genome of an individual by massively
parallel DNA sequencing. Nature,452, 872–876.
25. Ogata,H., Goto,S., Sato,K., Fujibuchi,W., Bono,H. and
Kanehisa,M. (1999) KEGG: Kyoto Encyclopedia of Genes and
Genomes. Nucleic Acids Res.,27, 29–34.
26. Jonsson,P.F. and Bates,P.A. (2006) Global topological features of
cancer proteins in the human interactome. Bioinformatics,22,
2291–2297.
27. Bredel,M., Scholtens,D.M., Harsh,G.R., Bredel,C., Chandler,J.P.,
Renfrow,J.J., Yadav,A.K., Vogel,H., Scheck,A.C., Tibshirani,R.
et al. (2009) A network model of a cooperative genetic landscape
in brain tumors. JAMA,302, 261–275.
28. Zhao,Q., Kirkness,E.F., Caballero,O.L., Galante,P.A.,
Parmigiani,R.B., Edsall,L., Kuan,S., Ye,Z., Levy,S.,
Vasconcelos,A.T. et al. (2011) Systematic detection of putative
Nucleic Acids Research, 2011, Vol. 39, No. 14 6067
tumor suppressor genes through the combined use of exome and
transcriptome sequencing. Genome Biol.,11, R114.
29. Michor,F., Iwasa,Y., Vogelstein,B., Lengauer,C. and Nowak,M.A.
(2005) Can chromosomal instability initiate tumorigenesis?
Sem. Cancer Biol.,15, 43–49.
30. Bielas,J.H., Loeb,K.R., Rubin,B.P., True,L.D. and Loeb,L.A.
(2006) Human cancers express a mutator phenotype.
Proc. Natl Acad. Sci. USA,103, 18238–18242.
31. Albertini,R.J., Nicklas,J.A., O’Neill,J.P. and Robison,S.H. (1990)
In vivo somatic mutations in humans: measurement and analysis.
Annu. Rev. Genet.,24, 305–326.
32. Jones,S., Chen,W.D., Parmigiani,G., Diehl,F., Beerenwinkel,N.,
Antal,T., Traulsen,A., Nowak,M.A., Siegel,C., Velculescu,V.E.
et al. (2008) Comparative lesion sequencing provides
insights into tumor evolution. Proc. Natl Acad. Sci. USA,
105, 4283–4288.
33. Sie,L., Loong,S. and Tan,E.K. (2009) Utility of lymphoblastoid
cell lines. J. Neurosci. Res.,87, 1953–1959.
34. Harismendy,O., Ng,P.C., Strausberg,R.L., Wang,X.,
Stockwell,T.B., Beeson,K.Y., Schork,N.J., Murray,S.S.,
Topol,E.J., Levy,S. et al. (2009) Evaluation of next generation
sequencing platforms for population targeted sequencing studies.
Genome Biol.,10, R32.
35. Wood,L.D., Parsons,D.W., Jones,S., Lin,J., Sjoblom,T.,
Leary,R.J., Shen,D., Boca,S.M., Barber,T., Ptak,J. et al. (2007)
The genomic landscapes of human breast and colorectal cancers.
Science,318, 1108–1113.
6068 Nucleic Acids Research, 2011, Vol. 39, No. 14
... We identify novel adjacencies in both HCC1954T and HCC1954N using 4 different methods: NAIBR, Long Ranger, GASV and LUMPY. We formed a set of 283 PCR-validated novel adjacencies by combining novel adjacencies from three studies: Bignell et al. (2007), Stephens et al. (2009) and Galante et al. (2011). ...
... BJR is supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, and an Alfred P. Sloan Research Fellowship. (Bignell et al., 2007;Stephens et al., 2009;Galante et al., 2011) ...
Article
Full-text available
Motivation: Structural variation, including large deletions, duplications, inversions, translocations, and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (~5-10) DNA molecules ~50Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants. Results: We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in a individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification - including two recent methods that also analyze linked-reads - on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes. Availability: Software is available at compbio.cs.brown.edu/software. Contact: braphael@princeton.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
... ;https://doi.org/10.1101/190454 doi: bioRxiv preprint from breast cancer cell line HCC1954 (Bignell et al., 2007;Stephens et al., 2009;Galante et al., 2011). ...
Preprint
Structural variation, including large deletions, duplications, inversions, translocations, and other rearrangements, is common in human and cancer genomes. A number of methods have been developed to identify structural variants from Illumina short-read sequencing data. However, reliable identification of structural variants remains challenging because many variants have breakpoints in repetitive regions of the genome and thus are difficult to identify with short reads. The recently developed linked-read sequencing technology from 10X Genomics combines a novel barcoding strategy with Illumina sequencing. This technology labels all reads that originate from a small number (~5-10) DNA molecules ~50Kbp in length with the same molecular barcode. These barcoded reads contain long-range sequence information that is advantageous for identification of structural variants. We present Novel Adjacency Identification with Barcoded Reads (NAIBR), an algorithm to identify structural variants in linked-read sequencing data. NAIBR predicts novel adjacencies in a individual genome resulting from structural variants using a probabilistic model that combines multiple signals in barcoded reads. We show that NAIBR outperforms several existing methods for structural variant identification – including two recent methods that also analyze linked-reads – on simulated sequencing data and 10X whole-genome sequencing data from the NA12878 human genome and the HCC1954 breast cancer cell line. Several of the novel somatic structural variants identified in HCC1954 overlap known cancer genes.
... Overall, these results are in agreement with previous studies that also analyzed postzygotic variants. 34,45 Additional investigations addressed the questions of body distribution of postzygotic mutations and mosaicism stability over time. Analyses of tissues derived from different embryonic layers revealed that most postzygotic mutations were detected in all analyzed tissues. ...
Article
Background: Post-zygotic de novo mutations lead to the phenomenon of gene mosaicism. The three main types are called somatic, gonadal and gonosomal mosaicism, which differ on the body distribution of post-zygotic mutations. Mosaicism has been occasionally reported in primary immunodeficiency diseases (PID) since early 90s, but its real involvement has not been systematically addressed. Objective: To investigate the incidence of gene mosaicism in PID. Methods: The amplicon-based deep sequencing method was employed in the three parts of the study that establish the allele frequency of germline variants (n:100), the incidence of parental gonosomal mosaicism in PID families with de novo mutations (n:92) and the incidence of mosaicism in PID families with moderate-to-high suspicious (n:36), respectively. Additional investigations evaluated body distribution of post-zygotic mutations, their stability over time and their characteristics. Results: The range of allele frequency 44.1-55.6% was established for germline variants. Those with minor allele frequency (MAF) <44.1% were assumed as post-zygotic. Mosaicism was detected in 30/128 (23.4%) PID families, with variable MAF (0.8-40.5%). Parental gonosomal mosaicism was detected in 6/92 (6.5%) families with de novo mutations, whereas a high incidence of mosaicism (63.9%) was detected among families with moderate-to-high suspicious. In most analyzed cases, mosaicism was found both uniformly distributed and stable over time. Conclusion: This study represents the largest one performed to date to investigate mosaicism in PID, revealing that it affects ≈25% of enrolled families. Our results may have serious consequences regarding patients' treatment and genetic counseling, and reinforce the use of NGS-based methods in the routine analyses of PID.
... The amino acid right next to P9, G10, also exhibits similar frameshift deletion (G10fs * 76) or insertion (G10fs * 70) or missense mutation (G10D) in five patients with colon cancer, CRC, gallbladder cancer, and glioblastoma (TCGA) (105,106,109). Missense mutations at R372 (R372C, H or S) of the TRAF-C domain of TRAF2 are detected in eight patients with HNSCC, melanoma, and prostate, uterine, cervical, stomach, and liver cancers (TCGA; COSMIC) (110)(111)(112)(113). Another amino acid of the TRAF-C domain, Q457, shows complex mutations, including a truncation (Q457 * ), a frameshift insertion (Q457fs * 277), and missense mutations (Q457K or L) in six patients of HNSCC, oral squamous cell carcinoma (OSCC), stomach cancer, melanoma, and breast cancer (TCGA; COSMIC) (8,114). Frameshift mutations occurring at P9 and G10 are functionally equivalent to deletion of TRAF2. ...
Article
Full-text available
The tumor necrosis factor receptor (TNF-R)-associated factor (TRAF) family of cytoplasmic adaptor proteins regulate the signal transduction pathways of a variety of receptors, including the TNF-R superfamily, Toll-like receptors (TLRs), NOD-like receptors (NLRs), RIG-I-like receptors (RLRs), and cytokine receptors. TRAF-dependent signaling pathways participate in a diverse array of important cellular processes, including the survival, proliferation, differentiation, and activation of different cell types. Many of these TRAF-dependent signaling pathways have been implicated in cancer pathogenesis. Here we analyze the current evidence of genetic alterations of TRAF molecules available from The Cancer Genome Atlas (TCGA) and the Catalog of Somatic Mutations in Cancer (COSMIC) as well as the published literature, including copy number variations and mutation landscape of TRAFs in various human cancers. Such analyses reveal that both gain- and loss-of-function genetic alterations of different TRAF proteins are commonly present in a number of human cancers. These include pancreatic cancer, meningioma, breast cancer, prostate cancer, lung cancer, liver cancer, head and neck cancer, stomach cancer, colon cancer, bladder cancer, uterine cancer, melanoma, sarcoma, and B cell malignancies, among others. Furthermore, we summarize the key in vivo and in vitro evidence that demonstrates the causal roles of genetic alterations of TRAF proteins in tumorigenesis within different cell types and organs. Taken together, the information presented in this review provides a rationale for the development of therapeutic strategies to manipulate TRAF proteins or TRAF-dependent signaling pathways in different human cancers by precision medicine.
... Previous studies have experimentally validated the predicted SVs and fusion-gene events for these two cell lines. Specifically, we compile results from [34][35][36][37][38] for HCC1954 and results from [13,35] for HCC1395. After removing short deletions and overlapping SVs from different studies, we have 326 validated SVs for the HCC1954 cell line, of which 245 have at least one breakpoint outside a gene region, and the rest (81) have both breakpoints within a gene region. ...
Article
Full-text available
Transcripts are frequently modified by structural variations, which lead to fused transcripts of either multiple genes, known as a fusion gene, or a gene and a previously non-transcribed sequence. Detecting these modifications, called transcriptomic structural variations (TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs accurately from RNA-seq alignments. SQUID unifies both concordant and discordant read alignments into one model and doubles the precision on simulation data compared to other approaches. Using SQUID, we identify novel non-fusion-gene TSVs on TCGA samples. Electronic supplementary material The online version of this article (10.1186/s13059-018-1421-5) contains supplementary material, which is available to authorized users.
... Previous mutation screening studies including these six cell lines have focused on detecting mutations only in coding regions, which explain the relatively high numbers of novel mutations outside these regions in our data. However, the fact that we could detect novel mutations in coding regions in cell lines previously analyzed by others could possibly be explained by differences in sequencing technology, on-target efficiency and sequence coverage, as well as analysis methodologies [22,25,29,30]. ...
Article
Full-text available
Basal-like breast cancer is an aggressive subtype generally characterized as poor prognosis and lacking the expression of the three most important clinical biomarkers, estrogen receptor, progesterone receptor, and HER2. Cell lines serve as useful model systems to study cancer biology in vitro and in vivo. We performed mutational profiling of six basal-like breast cancer cell lines (HCC38, HCC1143, HCC1187, HCC1395, HCC1954, and HCC1937) and their matched normal lymphocyte DNA using targeted capture and next-generation sequencing of 1,237 cancer-associated genes, including all exons, UTRs and upstream flanking regions. In total, 658 somatic variants were identified, of which 378 were non-silent (average 63 per cell line, range 37-146) and 315 were novel (not present in the Catalogue of Somatic Mutations in Cancer database; COSMIC). 125 novel mutations were confirmed by Sanger sequencing (59 exonic, 48 3'UTR and 10 5'UTR, 1 splicing), with a validation rate of 94% of high confidence variants. Of 36 mutations previously reported for these cell lines but not detected in our exome data, 36% could not be detected by Sanger sequencing. The base replacements C/G>A/T, C/G>G/C, C/G>T/A and A/T>G/C were significantly more frequent in the coding regions compared to the non-coding regions (OR 3.2, 95% CI 2.0-5.3, P
Article
Copy number variations (CNVs) which include deletions, duplications, inversions, translocations, and other forms of chromosomal re-arrangements are common to human cancers. In this report we investigated the pattern of these variations with the goal of understanding whether there exist specific cancer signatures. We used re-arrangement endpoint data deposited on the Catalogue of Somatic Mutations in Cancers (COSMIC) for our analysis. Indeed, we find that human cancers are characterized by specific patterns of chromosome rearrangements endpoints which in turn result in cancer specific CNVs. A review of the literature reveals tissue specific mutations which either drive these CNVs or appear as a consequence of CNVs because they confer an advantage to the cancer cell. We also identify several rearrangement endpoints hotspots that were not previously reported. Our analysis suggests that in addition to local chromosomal architecture, CNVs are driven by the internal cellular or nuclear physiology of each cancer tissue.
Article
Full-text available
About half of the known miRNA genes are located within protein-coding host genes, and are thus subject to co-transcription. Accumulating data indicate that this coupling may be an intrinsic mechanism to directly regulate the host gene's expression, constituting a negative feedback loop. Inevitably, the cell requires a yet largely unknown repertoire of methods to regulate this control mechanism. We propose APA as one possible mechanism by which negative feedback of intronic miRNA on their host genes might be regulated. Using in-silico analyses, we found that host genes that contain seed matching sites for their intronic miRNAs yield longer 32UTRs with more polyadenylation sites. Additionally, the distribution of polyadenylation signals differed significantly between these host genes and host genes of miRNAs that do not contain potential miRNA binding sites. We then transferred these in-silico results to a biological example and investigated the relationship between ZFR and its intronic miRNA miR-579 in a U87 cell line model. We found that ZFR is targeted by its intronic miRNA miR-579 and that alternative polyadenylation allows differential targeting. We additionally used bioinformatics analyses and RNA-Seq to evaluate a potential cross-talk between intronic miRNAs and alternative polyadenylation. CPSF2, a gene previously associated with alternative polyadenylation signal recognition, might be linked to intronic miRNA negative feedback by altering polyadenylation signal utilization.
Article
Full-text available
Hardly a month goes by without a new published report of a patient’s genome being used diagnostically for clinical management in a diverse spectrum of disease areas, including gastroenterology, nephrology, neurology and oncology. The impression is that clinical genomics is already becoming semi-routine. However, a large and complex set of non-technical barriers needs to be overcome before genomics can truly be integrated into the practice of medicine and made widely available for patient care. Through the use of case studies, my presentation will elucidate issues relating to the needs and requirements of the workforce, the legal and regulatory aspects of ‘laboratory-developed tests’ and insurance reimbursement for ‘multi-analyte diagnostics’. The roles of the Food and Drug Administration, the Centers for Medicare & Medicaid Services and the College of American Pathologists will be highlighted.
Article
Full-text available
Author Summary Extensive tumor genome sequencing has provided raw material to understand mutational processes and identify cancer-associated somatic variants. However, fundamental problems remain to: i) separate ‘driver’ from ‘passenger’ mutations, ii) further understand the functional mechanisms and consequences of driver mutations, and iii) identify the cancer types in which each driver mutation is relevant. Here we analyze whole-genome and exome tumor sequencing data from the perspective of protein domains—the basic structural and functional units of proteins. Exploring the cancer-type-specific landscape of domain mutations across 21 cancer types, we identify both cancer-type-specific mutated domains and mutational hotspots. Frequently-mutated domains were identified for oncoproteins for which the ‘mutational hotspot’ phenomenon owing to the relative rarity of gain-of-function mutations is well known, and also for tumor suppressor proteins, for which more uniformly distributed loss-of-function driver mutations are expected. A given gene product may be perturbed differently in different cancers. Indeed, we observed systematic shifts between cancer types of the positions at which mutations occur within a given protein. Both known and novel candidate driver mutations were retrieved. Novel cancer gene candidates significantly overlapped with orthogonal systematic cancer screen hits, supporting the power of this approach to identify cancer genes.
Article
Full-text available
DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Article
Full-text available
The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial1 and small sets of nuclear markers2 have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans1, 3. However, until now, fully sequenced human genomes have been limited to recently diverged populations4, 5, 6, 7, 8. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
Article
Full-text available
Massively parallel DNA sequencing technologies provide an unprecedented ability to screen entire genomes for genetic changes associated with tumour progression. Here we describe the genomic analyses of four DNA samples from an African-American patient with basal-like breast cancer: peripheral blood, the primary tumour, a brain metastasis and a xenograft derived from the primary tumour. The metastasis contained two de novo mutations and a large deletion not present in the primary tumour, and was significantly enriched for 20 shared mutations. The xenograft retained all primary tumour mutations and displayed a mutation enrichment pattern that resembled the metastasis. Two overlapping large deletions, encompassing CTNNA1, were present in all three tumour samples. The differential mutation frequencies and structural variation patterns in metastasis and xenograft compared with the primary tumour indicate that secondary tumours may arise from a minority of cells within the primary tumour.
Article
Full-text available
Motivation: The study of interactomes, or networks of protein-protein interactions, is increasingly providing valuable information on biological systems. Here we report a study of cancer proteins in an extensive human protein-protein interaction network constructed by computational methods. Results: We show that human proteins translated from known cancer genes exhibit a network topology that is different from that of proteins not documented as being mutated in cancer. In particular, cancer proteins show an increase in the number of proteins they interact with. They also appear to participate in central hubs rather than peripheral ones, mirroring their greater centrality and participation in networks that form the backbone of the proteome. Moreover, we show that cancer proteins contain a high ratio of highly promiscuous structural domains, i.e., domains with a high propensity for mediating protein interactions. These observations indicate an underlying evolutionary distinction between the two groups of proteins, reflecting the central roles of proteins, whose mutations lead to cancer.
Article
The goal of our study was to develop a panel of tumor cell lines along with paired non‐malignant cell lines or strains collected from breast cancers, predominantly primary tumors. From a total of 189 breast tumor samples consisting of 177 primary tumors and 12 metastatic tissues, we established 21 human breast tumor cell lines that included 18 cell lines derived from primary tumors and 3 derived from metastatic lesions. Cell lines included those from patients with germline BRCA1 and FHIT gene mutations and others with possible genetic predisposition. For 19 tumor cell lines, we also established one or more corresponding non‐malignant cell strains or B lymphoblastoid (BL) lines, which included 16 BL lines and 7 breast epithelial (2) or stromal (5) cell strains. The present report describes clinical, pathological and molecular information regarding the normal and tumor tissue sources along with relevant personal information and familial medical history. Analysis of the breast tumor cell lines indicated that most of the cell lines had the following features: they were derived from large tumors with or without axillary node metastases; were aneuploid and exhibited a moderate to poorly differentiated phenotype; were estrogen receptor (ER)‐ and progesterone receptor (PR)‐negative; and overexpressed p53 and HER2/neu proteins. Of 13 patients with primary breast cancers receiving curative intent mastectomies, 7 were dead after a mean period of 10 months. Our panel of paired tumor and non‐malignant cell lines should provide important new reagents for breast cancer research. Int. J. Cancer 78:766–774, 1998. © 1998 Wiley‐Liss, Inc.
Article
... method on the ABI 3730xL platform (hereafter referred to as ABI Sanger ) [2-4]. To date these new technologies have ... genome sequencing [9-11]. Currently there is much interest in applying NGS platforms for targeted sequencing of specific candidate genes , intervals identified ...
Article
Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.