Next generation sequencing for TCR repertoire profiling: platform-
specific features and correction algorithms
Dmitry A. Bolotin1,#, Ilgar Z. Mamedov1,#, Olga V. Britanova1, Ivan V. Zvyagin1, Dmitriy
Shagin1,2, Svetlana V. Ustyugova1, Maria A. Turchaninova1, Sergey Lukyanov1, Yury B.
Lebedev1, and Dmitriy M. Chudakov1,*
1 Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry RAS, Miklukho-Maklaya
16/10, Moscow, Russia, 117997
2 Evrogen JSC, Miklukho-Maklaya 16/10, Moscow, Russia, 117997
* Corresponding author. Tel: +74997248122; E-mail: ChudakovDM@mail.ru
# These authors contributed equally to this work.
This article has been accepted for publication and undergone full peer review but has not
been through the copyediting, typesetting, pagination and proofreading process which may
lead to differences between this version and the Version of Record. Please cite this article
as an ‘Accepted Article’, doi: 10.1002/eji.201242517
? ? 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Received: March 6, 2012; Revised: June 14, 2012; Accepted: July 10, 2012
The T-cell receptor (TCR) repertoire is a mirror of the human immune system that reflects
processes caused by infections, cancer, autoimmunity, and aging. Next generation
sequencing (NGS) is becoming a powerful tool for deep TCR profiling; yet, questions
abound regarding the methodological approaches for sample preparation and correct data
interpretation. Accumulated PCR and sequencing errors along with library preparation
bottlenecks and uneven PCR efficiencies lead to information loss, biased quantification,
and generation of huge artificial TCR diversity. Here, we compare Illumina, 454, and Ion
Torrent platforms for individual TCR profiling, evaluate the rate and character of errors,
and propose advanced platform-specific algorithms to correct massive sequencing data.
These developments are applicable to a wide variety of NGS applications. We demonstrate
that advanced correction allows the removal of the majority of artificial TCR diversity with
concomitant rescue of most of the sequencing information. Thus, this correction enhances
the accuracy of clonotype identification and quantification as well as overall TCR diversity
The adaptive immune system drives the immune response via specific hypervariable
molecules: B cell-generated immunoglobulins and immunoglobulin-like T-cell receptors
(TCRs) on T lymphocytes. These molecules are formed by genomic recombination in order
to recognize foreign structures. The antigen specificity of each TCR is largely determined
by the hypervariable complementarity-determining region 3 (CDR3) of the receptor beta
chain, which is formed by recombination of V, D, and J gene segments with template-
independent addition of nucleotides at the gene segment junctions.
Realistic analysis of the TCR CDR3 diversity is crucial for the understanding of the
basic molecular mechanisms of adaptive immunity function in health and disease, including
consequences of aging, immunosuppressive and blood transplantation therapies, irradiation,
autoimmunity, infections, and other conditions . Moreover, high convergence of TCR
beta CDR3 amino acid sequences , along with the expanding knowledge on their
specificities , suggests the potential for diagnostics of multiple infections and
pathological states via deep individual TCR profiling. These perspectives demand the
development of powerful technological and software approaches for deep error-free TCR
Next generation sequencing (NGS) techniques have revolutionized this field by
allowing massive analysis of TCR and antibody repertoires from individual blood samples
[4-6] both before and after therapy [7, 8] and for different lymphocyte subsets [2, 9, 10];
however, performing deep, unbiased, and quantitative analysis of millions of CDR3
sequences that include highly homologous variants is quite challenging, due to numerous
bottlenecks, accumulation of PCR and sequencing errors, and ratio bias. Altogether, these
technical challenges lead to the loss of the original TCR repertoire of an analyzed T-cell
sample, generation of false TCR diversity, and inability to interpret sequence information
in a quantitative way, thus challenging intelligent analysis and cross-comparison of
acquired datasets and generally complicating adaptive immunity studies.
The specific problem that distinguishes TCR repertoire analysis from sequencing
analysis of a known genome or transcriptome resides in the fact that generating more
sequencing reads results in the expansion of erroneous sequence variants with one, two, or
more mismatches, and these changes raise the question of whether this sequence represents
the same CDR3 with introduced errors or a different independently formed TCR. Here, we
compare different NGS platforms for TCR profiling and propose advanced algorithms for
efficient and valid correction of platform-specific errors with minimal loss of quantitative
information originally contained in a lymphocyte sample.
TCR library preparation and NGS
To provide reproducibility controls, we isolated RNA from three equal aliquots of
peripheral blood mononuclear cells (PBMCs; hereafter called A, B, and C) that were
obtained from the three parallel blood samples from the same donor. The first stage of
library preparation was performed as described . Briefly, cDNA synthesis was
performed using a specific primer targeted to the constant region of TCR beta. A template
switching effect was employed to generate a universal primer site at the 5’-end (Supporting
Information Fig. 1). Next, a universal nested primer corresponding to the constant TCR
beta region and a universal 5’-primer were used in the first amplification step. For the final
pre-sequencing sample preparation, we employed different strategies for the 454, Ion
Torrent, and Illumina platforms (see Materials and Methods for details).
Importantly, the total number of PCR amplification cycles for pre-sequencing
library preparation for each of the platforms did not exceed 22 cycles (taking into account
the dilution of the first PCR reaction). Considering the average efficiency of PCR
amplification as 1.8 (measured by real-time PCR, data not shown), the starting number of
molecules that entered PCR amplification after cDNA synthesis was estimated to be
approximately 1.2 × 106 for each sample. This bottleneck determined the possible depth of
TCR profiling in our experimental system (see Discussion).
Data obtained with the three NGS platforms exhibited significant differences
stemming from the sequencing technologies. Data from the 454 platform were less
abundant, but the sequencing reads were the longest of the three platforms. The Illumina
produced 100-fold more sequencing reads of shorter length, and the performance of the Ion
Torrent was between the other two with respect to both parameters (Table 2 and Supporting
Information Fig. 2).
Because reverse transcription, amplification, and sequencing each generate multiple
errors , further analysis of NGS data required deep platform-specific processing to
provide correct extraction and grouping of sequences with identical CDR3 regions that
corresponded to distinct TCR clonotypes. Below, we provide optimized algorithms for such
processing, which resulted in efficient correction of PCR and sequencing errors with
minimal information loss.
Algorithms to eliminate PCR and sequencing errors for the Illumina platform
As was shown earlier, a significant number of PCR errors is accumulated during the
course of pre-sequencing PCR and bridge amplification on the Illumina, and these errors
can add to or even multiply the actual TCR repertoire diversity [6, 11]. To eliminate this
artificial diversity, withdrawal of the low-abundance CDR3 variants that differ from the
high-abundance variants by a single nucleotide mismatch  or removal of low-
abundance sequence variants that comprise a total of 4% of all sequencing reads was
suggested . While these approaches appear reasonable, blind elimination of low-
abundance TCR clonotypes can be disastrous for studies of converging clonotype variants.
At the same time, complete elimination of artificial diversity is by far not achieved.
Another important issue is low-quality sequencing reads. Retaining low-quality
sequences increases the number of erroneous clonotypes and thus, generates artificially
increased diversity of the TCR repertoire. As sequence data are provided with a quality
measure for each nucleotide, the low-quality sequences could potentially just be eliminated
[6, 11]; however, we believe that this approach is erroneous for quantitative analysis for
two reasons. First, strong filtering conditions result in loss of a significant portion of the
data (up to 50% sequencing reads for the Illumina platform and even more for other
platforms, see below). Second, NGS errors may occur in a sequence-dependent manner
, leading to underestimation of certain TCR clonotypes.
Based on these considerations, we propose the following advanced error correction
algorithm for processing Illumina TCR profiling data. This algorithm provides efficient and
safe elimination of PCR and sequencing errors with minimal information loss.
i) CDR3 extraction. From each sequencing read, the CDR3 is extracted, if possible.
This step is based on the alignment of each sequence to the set of genomic V, D, and J
segments from the IMGT/GENE-DB database and identification of conservative Cys and
Phe at the CDR3 boundaries. For the V, D, and J gene segment identification, low-quality
nucleotides are treated as allowable mismatches (detailed in the Materials and Methods).
ii) Formation of the “core clonotypes.” Reads with identical CDR3 nucleotide
sequences with high-quality sequence at each nucleotide position within CDR3 are grouped
to form “core clonotypes.”
iii) Mapping of low-quality reads. Low-quality sequencing reads with up to 3 low-
quality nucleotides within the CDR3 are merged with the latter “core clonotypes”
(mismatches allowed in low-quality positions). Sequencing reads that contain more than 3
low-quality nucleotides within the CDR3 or fail to map to a “core clonotype” are removed.
iv) Correction of PCR errors. Mismatches within V, D, or J segments of CDR3 do
not arise naturally, because TCRs do not undergo somatic hypermutation. Therefore,
nucleotide mismatches within the segments can only arise from PCR and sequencing errors
and may be safely corrected. At this step, low-abundant “core clonotypes” are merged with
the more abundant (at least 5-fold more abundant) “core clonotype” that differs by no more
than 3 nucleotides within the V, D, or J segments of CDR3 (since percentage of sequencing
reads carrying more than 3 errors is negligible, see below). Of these mismatches, no more
than 2 are allowed within the V gene segment (excluding the last 2 identified nucleotides),
no more than 2 are allowed within the J gene segment (excluding the first 2 identified
nucleotides), and no more than 1 is allowed within the D gene segment (excluding the first
2 and the last 2 identified nucleotides). This additional constraint was introduced to avoid
clusterization of clones with homologous but actually different segments. Such correction
is safe with respect to the potential risk of losing natural TCR diversity, since no
mismatches within the nucleotides added or trimmed during TCR rearrangement are used
for correction (see Table 1). In contrast to the blind cutoff of 4% sequencing reads of low-
abundance CDR3 variants proposed by Warren et al , this approach allows rational
elimination of PCR errors while maintaining the natural TCR diversity.
Each of the final clonotypes is characterized by its unique CDR3 nucleotide
sequence, identified V, D, and J gene segments (all determined by the dominant “core
clonotype”), and the total merged reads number that corresponds to the relative abundance
of a T-cell clone within the initial sample. According to the proposed algorithm, these reads
include identical high-quality CDR3 sequencing reads, homologous low-quality reads, and
homologous high-quality reads of lower abundance (transcription, reverse transcription,
PCR, and sequencing errors). Importantly, none of the low-quality reads forms a final
clonotype. Instead, low-quality reads are rescued by mapping to the final clonotypes
formed by high-quality sequencing reads.
Advanced correction resulted in a dramatic reduction of the estimated TCR
diversity (2-fold reduction of extracted clonotypes number) in the Illumina data obtained
for the complex mix of blood-derived T-cell sample (Table 2, advanced correction). At the
same time, 90% of the quantitative sequencing information (i.e., total number of evaluable
sequencing reads) was preserved, resulting in a nearly 2-fold increase in reads per
clonotype, a parameter that generally determines quantification power of the whole method
(see below). In contrast, removal of the low-quality reads resulted in a substantial loss of
quantitative information, although the artificial diversity was not eliminated efficiently
The percentages of the sequencing reads that mapped to the 1000 major CDR3
clonotypes with single, double, and triple nucleotide mismatches within the CDR3 were
3.59%, 0.12%, and 0.01%, respectively, confirming that percentage of sequencing reads
with > 3 errors is negligible. Notably, 100% of the mapped sequences with double and
triple nucleotide mismatches represented minor descendants of the mapped sequences with
single nucleotide mismatches (such as shown in gray in Table 1), indicating these
mismatches originated from secondary errors during later PCR cycles.
Advanced correction eliminated the major portion of PCR errors. To illustrate this
result, the number of CDR3 sequences that were found only once in a dataset that mostly
comprised reads with sequencing or PCR errors was greatly reduced after advanced
correction (compare Fig. 1A and C), whereas straightforward removal of the low-quality
reads was much less efficient in this respect (Fig. 1B). The number of erroneous but high-
quality CDR3 sequence variants that were successfully mapped to a single large clonotype
often exceeded 100 (see Supporting Information Dataset 1 for example), providing a vivid
example of how accumulated PCR errors can multiply real TCR diversity.
By definition, proposed algorithm can not eliminate errors within added nucleotides
or within the edges of the V, D and J gene segments. Assuming uniform distribution of
error positions within CDR3, this limitation leads to undercorrection of approximately 15-
35% of PCR errors, depending on the nature of the analyzed TCR library. By excluding
these limitations, remaining errors can be efficiently corrected as well (data not shown) but
the significant risk appears to lose natural TCR diversity, which remains minimal with the
cautious correction algorithm proposed above.
Analysis algorithms for 454 and Ion Torrent TCR data
One problem that is specific to the 454 and Ion Torrent platforms is imperfect
interpretation of homopolymeric stretches, as this difficulty leads to erroneous nucleotide
deletions and insertions. To take the uncertainty in nucleotide lengths into account for
further analysis, we initially rebuild sequences from the raw flowgrams, where we interpret
ambiguous signals by rounding up with the last position in a stretch designated as
“uncertain”. For example, raw flowgram values of 3.8-4.2 are interpreted as a 4-bp
nucleotide stretch, whereas values of 4.21-4.79 are interpreted as a 5-bp stretch with the 5th
position designated “uncertain”. Values of 4.8-5.2 are interpreted as certain 5-bp stretch.
The “uncertain” positions are treated in further analysis as possible insertions, thus
allowing us to correct for this type of sequencing error as described below.
The whole analysis algorithm is generally similar to the algorithm we proposed for
i) Rebuilding sequences from raw data. Ambiguous flowgram values are rounded
up, with the last position in the resulting nucleotide stretch designated as “uncertain.”
ii) CDR3 extraction. Similar to the algorithm for the Illumina, for identification of
the V, D, and J gene segments, “uncertain” nucleotides are treated as possible insertions.
iii) Formation of “core clonotypes.” As for the Illumina, high-quality reads without
ambiguous flowgram values within the CDR3 are merged to form “core clonotypes.”
iv) Mapping of low-quality reads. Similar to the algorithm for the Illumina, up to 3
“uncertain” nucleotides within the CDR3 are treated as possible insertions and
corresponding sequencing reads are merged with the “core clonotypes.”
v) Correction of PCR errors. Correction is identical to the algorithm for the
Additional modification of the algorithm is possible by employing asymmetric
interpretation of raw flowgram data that have an initial systemic shift in signal readout. For
example, flowgram values from 3.9-4.3 bp can be interpreted with certainty as a 4-bp
stretch, leading to a -0.1-bp shift in the average lengths of CDR3 sequences.
As shown in Table 2, advanced error correction works efficiently for the 454 and
Ion Torrent data and results in a dramatic reduction of interpreted TCR diversity, minimal
loss of quantitative information, and up to a 4-fold increase in the read-per-clonotype
To evaluate the error rate of deletions and insertions, we performed in silico
spectratyping analysis of sequence datasets, in which functional “in-frame” clonotypes
have CDR3 nucleotide sequence lengths in multiples of 3. Stretch errors result in the
generation of artificial shorter or longer out-of-frame sequences (Fig. 2). Without
correction, as much as 29.3% of the 454 sequences and 25.0% of the Ion Torrent sequences
fall to the out-of-frame columns compared to only 1% of the Illumina sequences. In the
latter platform data, these sequences may result from the naturally presented non-functional
out-of-frame TCR beta RNA molecules that have escaped nonsense-mediated decay. The
percentage of out-of-frame CDR3 sequencing reads was essentially decreased after
advanced correction. In fact, for the 454 data, these sequences were decreased to 22.2%,
and for the Ion Torrent data, these sequences were decreased to 7.2%. Overall, advanced
correction allowed the conversion of Ion Torrent data to Illumina-like results that are
intrinsically free of deletion/insertion errors (Table 2 and Fig 2D). With other more
abundant 454 datasets of higher quality, correction for the out-of-frame reads was also
quite efficient (data not shown).
Still, relatively low number of high quality CDR3-containing sequencing reads
obtained from Ion Torrent and 454 compared to Illumina creates additional bottleneck at
the sequencing level, that leads to essential loss of low abundant clonotypes and decreases
efficiency of error correction algorithms.
The overall rate of corrected PCR errors for each of the three NGS platform datasets
was analyzed by counting the number of mismatch-containing high-quality sequencing
reads mapped to the high-quality “core clonotypes.” The percentage of such mismatch-
containing sequencing reads was notably higher for the Illumina dataset than for the 454
and Ion Torrent datasets (3.2%, 1.4%, and 1.2% of sequencing reads, respectively),
corresponding to recently reported data that demonstrated a high rate of PCR errors during
Illumina bridge amplification, leading to up to 3.19% of high-quality Illumina reads with
PCR errors .
As expected, the rate of mismatched nucleotide incorporation was not evenly
distributed, and transitions were favored over transversions (see Supporting Information
Table 2). Interestingly, we obtained a significantly higher rate for A-to-G substitutions as
compared to T-to-C substitutions for all platforms (38-41% vs. 13-20%). As PCR
substitutions should be symmetrical (i.e., G-to-A substitutions on the antisense strand will
result in a T-to-C substitution on the sense DNA strand, yielding equivalent rates for G-to-
A and C-to-T substitutions), we suggest that these errors were generated during the course
of one of the asymmetrical processes, which include natural transcription errors, cDNA
synthesis, or sequencing. As this type of bias was observed in all datasets generated by
each of the different sequencing technologies and the natural transcription error rate is
generally low , these errors were most likely generated by reverse transcription.
Importantly, due to their early origin and high abundance within the final NGS data, these
frequent errors cannot be corrected by previously proposed algorithms [6, 11]; however,
our described algorithms efficiently capture such errors.
Quantification power of NGS-based TCR profiling
Quantitative NGS-based analysis of relative T-cell clone abundances within an
initial lymphocyte sample is highly desirable for the individual and for cohort comparative
studies of adaptive immunity . In order to compare the relative quantification accuracy
for the three platforms, we plotted the mean percentages of each J beta gene, V beta gene,
and the 50 most abundant individual CDR3 clonotypes for the Illumina dataset and
compared these to the 454 and Ion Torrent datasets (Fig. 3). We did not observe significant
bias between the 454 and Illumina datasets with respect to the abundance of J beta and V
beta gene segments as well as with respect to the major CDR3 clonotypes (Fig. 3A, B, and
C). We did, however, observe a significantly weaker correlation between the Illumina and
Ion Torrent data (Fig. 3D, E, and F) with a more than 100-fold difference for particular
CDR3 clonotypes. Comparison of NGS data with flow cytometry data obtained from the
blood cells of the same patient using TCR V beta family-specific antibodies demonstrated
that, while the 454 and Illumina data correlated well with the real abundance of
corresponding T-cell subsets, the Ion Torrent results were skewed with respect to the
relative TCR V beta gene abundances (Fig. 4).
The main bias in TCR quantification may arise from multiplex PCR, which is used
on the second amplification step during Illumina and Ion Torrent sample preparation, since
multiplex PCR is known to skew the abundance of template molecules [14, 15]. Therefore,
we attribute the dramatic skew in Ion Torrent data to multiplex amplification with a
complex mix of TCR V beta-specific primers that had to be employed for the Ion Torrent
library preparation. These data indicate that multiplex PCR for pre-sequencing library
preparation should be avoided or very thoroughly corrected to provide quantitative
measurement of TCR clonotype concentrations by NGS profiling. Apparently, multiplex
amplification with a simple mix of 13 highly homologous J genes-specific primers at the
second step of Illumina sample preparation did not significantly skew the natural
abundance of TCR molecules. However, this multiplexing can be avoided as well by using
primers specific to the constant region of TCR genes and 150 bp Illumina read length.
The overall accuracy of quantification of individual clonotypes using Illumina
sequencing was estimated via comparison of the datasets obtained for the three independent
blood samples (A, B, and C). These data were corrected using our advanced algorithm (see
above) or by removal of low-quality reads. As expected from the increased reads-per-run
parameter and the number of rescued low-quality reads, advanced correction significantly
(p=3.45E-7 by the Kruskal-Wallis test) increased the accuracy of quantification for T-cell
clonotypes with concentrations greater than 0.01% (Fig. 5). As a result, a quantification
accuracy of 50% (i.e., CI95% for concentration value is (C/1.5 : C*1.5), where C is the
measured relative concentration of a clonotype) was achieved for the clonotypes that
constitute at least 0.1% of all T cells.
Application of NGS for deep TCR repertoire characterization raised several
problems concerning the accuracy of sequencing and the quantification of individual TCR
clonotypes. Several methods of probe preparation and data processing for NGS TCR
repertoire analysis have been proposed [4-6, 8], but the issue of artificial TCR diversity
generation has not been fully resolved. Here, we propose correction algorithms that allow
the elimination of most accumulated errors without the risk of losing natural independent
clonotypes and with minimal loss of quantitative data, which is crucial for the proper
identification of TCR clonotypes and measurement of their actual diversity and relative
concentrations. Insertion/deletion errors characteristic of the 454 and Ion Torrent
technologies can also be eliminated quite efficiently after re-interpretation of raw flowgram
data. These algorithms, thus, allow use of these NGS platforms in applications where
longer reads (precise TCR V beta/alpha gene identification or antibody sequencing) or
faster or cheaper (per run but not per read) results are required.
Accurate quantification of certain T-cell clones in the analyzed sample is an
important goal in TCR repertoire studies that aim to track changes made by therapeutic
procedures or immune responses. We demonstrated that complex multiplex PCR during
pre-sequencing library preparation is inapplicable for quantitative TCR profiling by NGS.
Again, rational mapping of low-quality reads instead of blind filtering is crucial to preserve
quantitative information and to significantly increase the resulting accuracy of clonotype
The depth of TCR profiling is limited by bottlenecks that may appear at any stage,
including the initial T-cell count, number of molecules that successfully entered PCR
amplification, and number of output sequencing reads. Thus, even amplification starting
from a sufficient number of molecules is one of the crucial stages that largely determines
the scope of quantitative and unbiased analysis of complex TCR gene libraries. Albeit with
several pros and cons, genomic DNA or mRNA can be used as a starting material for TCR
profiling. We believe the latter is preferable in most cases due to the following reasons:
1) Each T-cell contains multiple copies of RNA molecules that encode the TCR beta and
alpha chains, and these copies expand the potential bottleneck between sampled T cells
and the final TCR amplicon.
2) When starting from genomic DNA, the entire sample isolated from certain PBMC
aliquots must be amplified to gain a comprehensive representation of the TCR
repertoire. This inclusion may be technically challenging when large T-cell populations
are studied. For example, a starting sample of 10 million T cells requires amplification
of approximately 100 ?g of PBMC-derived genomic DNA. In contrast, a reasonable
aliquot (5-10 ?g) of an RNA sample obtained from the same cell population is
sufficient to sample the diversity.
3) At the DNA level, each T-cell carries two rearranged TCR beta genes, and one of them
is nonfunctional. In contrast, out-of-frame mRNA molecules are efficiently degraded by
the nonsense-mediated decay mechanism [16, 17], and thus, these nonfunctional
molecules are not sampled.
4) TCR beta cDNA can be synthesized and amplified using universal primers specific to
the constant region of TCR beta genes [5, 8]. In addition, the template switching effect
can be used in reverse transcription , providing universal priming at the 5’ end of
mRNA. This technique allows us to avoid use of the complex multiplex primer sets
required to amplify all TCR V beta and J beta gene variants and thus, decreases PCR
Depending on the questions posed, the required depth of TCR repertoire analysis
can vary across a wide range. Cell count intervals that are calculated according to the
Poisson model suggest that a comprehensive analysis of a population containing a diversity
of 107 (estimated individual TCR beta diversity ) requires analysis of a sample of at least
108 cells. The corresponding number of T cells is contained in approximately 50-70 ml of
blood from a healthy individual. At any further stages of library preparation, starting from
DNA or RNA purification and up to the NGS output, the number of molecules/sequencing
reads should be well above 100 million. These conditions are potentially feasible (albeit
laborious and expensive) prerequisites for the nearly comprehensive analysis of an
individual’s TCR beta diversity, which includes not only effector and memory T cells but
also a considerable portion of the naïve T-cell repertoire. Much smaller bottleneck limits
can be estimated for the analysis of a narrower T-cell population, such as a fraction of
effector or memory T cells or a pre-sorted subpopulation of specific TCR V beta family T
cells, Treg cells, MHC-tetramer sorted cells, etc. [2, 19]. Still, a clear understanding of
parameters that determine the sample preparation bottlenecks and their minimal required
size is important for adequate experimental design.
In conclusion, rational analysis of the TCR repertoire by NGS in order to make firm
and clear statements based on the data obtained demands intelligent design of the whole
experimental pipeline, including blood sampling, DNA/RNA purification, cDNA synthesis,
PCR amplification, pre-sequencing preparation, sequencing depth, platform choice, and
finally intelligent interpretation of the NGS output with regard to all potential errors
accumulated above. We believe that the algorithms proposed here for the efficient
correction of PCR and sequencing errors and for rescue of the quantitative information
contained in low-quality reads should contribute to rational analysis of the TCR repertoire
and fuel crucial adaptive immunity studies.
Materials and Methods
Accession number for all raw sequencing data of this study:
NCBI Sequence Read Archive (SRA): SRP012846.
Sample preparation and sequencing
Three aliquots of peripheral blood from 43-year-old man were obtained. Informed consent was
obtained from the patient for these procedures. PBMCs (3-4 × 106) for each of three replicates (A,
B, and C) were isolated by Ficoll-Paque (Paneco, Russia) density gradient centrifugation. RNA was
isolated using the Trizol reagent (Invitrogen, USA) according to the manufacturer’s protocol. First-
strand cDNA was synthesized for 2 h using the Mint cDNA synthesis kit (Evrogen, Russia) and the
primer BC1R (5’-CAGTATCTGGAGTCATTGA-3’), which is specific to both variants of the TCR
beta constant regions. PlugOligo (AAGCAGTGGTATCAACGCAGAGTACGGGGG-P, Evrogen)
was added after 30 min of synthesis. The PCR amplification protocol was as follows for 18 cycles:
94°C for 20 s, 65°C for 20 s, and 72°C for 50 s. The PCR mixture (15 ?l) contained 1x Encyclo
polymerase buffer (Evrogen), 0.125 mM of each dNTP, 10 pmol of universal primers: M1
(AAGCAGTGGTATCAACGCAGAGT) and BC2R (5’-TGCTTCTGATGGCTCAAACAC-3’), 0.3
?l of Encyclo polymerase mix, and 1 ?l of undiluted first-strand cDNA. After the first-strand cDNA
synthesis, the first PCR was performed in 15 tubes in order to amplify all TCR molecules for each
of the independent sample replicates (A, B, and C). Then, the PCR products from all 15 tubes were
mixed and diluted 1:10. Then, 1 ?l of the diluted PCR product was used in each of the 10 second-
step PCR reactions (25 ?l each), which were performed with conditions specific for each sample
For the 454 sample: We used nested 3’ primers with barcodes that distinguished the A, B, and C
replicates in combination with a universal 5’ primer to generate a 500-550-bp amplicon readily
flanked by 454 sequencing primers. Thus, 454 reads directionally started from the constant TCR
beta region and covered the J gene segment, CDR3, and a part of the V gene segment that was
sufficient for its unambiguous identification (Supporting Information Fig. 1). This technique
permits amplification of the TCR beta genes with the same pair of primers for all TCR V beta gene
variants, and this property decreases potential PCR bias. Each PCR reaction contained 1 ?l of the
first PCR (diluted), 1x PCR buffer, 0.125 mM of each dNTP, 10 pmol of the primer B-M1 (5’-
GCCTTGCCAGCCCGCTCAGAAGCAGTGGTATCAACGCAGAGT-3’) and 10 pmols of one of
the reverse primers, which contained specific barcodes for each of the three replicates (see
Supporting Information Table 1). The amplification protocol was as follows for 12 cycles: 94°C
for 20 s, 68°C for 20 s, and 72°C for 50 s. After the second PCR the amplicons were purified by
agarose gel electrophoresis using a DNA gel extraction kit (Cytokin, Russia). The purified
amplicons were re-amplified for two additional cycles and finally purified with the QIAquick PCR
purification kit (Qiagen, USA). Sequencing was performed using Genome Sequencer FLX, GS Em
PCR Kit II (Roche Applied Science).
For the Illumina sample: Due to the relatively short sequencing reads produced by the Illumina
platform, pre-sequencing preparation requires nested PCR amplification of the TCR beta library to
achieve sequencing of the CDR3 region of interest. To this end, we used a set of multiplex J beta-
specific primers containing an MmeI restriction site. MmeI treatment resulted in PCR fragments
that ended with the triplet coding for the last conservative phenylalanine of CDR3 and were, thus,
suitable for the 72-bp Illumina sequencing (Supporting Information Fig. 1). At the 5’ end, the
universal primer that annealed to the template formed by the switching effect was used, resulting in
a PCR library of fragments with approximately 450-500 bp. Each PCR reaction contained 1 ?l of
the first PCR (diluted), 1x PCR buffer, 0.125 mM of each dNTP, 5 pmol of primer MmeStep1 (5’-
CACTCTATCCGACAAGCAGT -3’), 0.5 pmol MmeSmart20 (5’-
CACTCTATCCGACAAGCAGTGGTATCAACGCAG-3’), 3 pmol of each of the 13 MmeJ
primers (Supporting Information Table 1), and 0.5 ?l of the Encyclo polymerase mix (Evrogen).
The amplification protocol was as follows for 12 cycles: 94°C for 20 s, 68°C for 20 s, and 72°C for
50 s. After the second PCR step, the amplicons were purified with the QIAquick PCR purification
kit (Qiagen, USA) and digested by MmeI (New England Biolabs, USA) according to the
manufacturer’s protocol. The efficiency of MmeI digestion was estimated to be about 80%
according to a control PCR with primers that annealed to the fragment to be removed by the
endonuclease (data not shown). After MmeI digestion, the DNA samples were purified by the
QIAquick PCR purification kit. After end polishing, the digested amplicons were ligated to bar-
coded adaptors to distinguish the A, B, and C replicates, processed using the standard Illumina
sequencing protocol, and sequenced on the Genome Analyzer IIx (Illumina).
For the Ion Torrent sample: The Ion Torrent generates more that 10 Mbp data with sequencing
reads of about 100 bp (314 chip). One of the points that distinguishes this platform from the 454
and Illumina is the length restriction of the DNA fragments (150-200 bp) that can bind to the
sequencing chip. Therefore, to prepare a library suitable for Ion Torrent sequencing, multiplex PCR
was implemented on both sides of the TCR molecules at the second PCR stage (Supporting
Information Fig. 1). The diluted first PCR reactions for the replicates A, B, and C were mixed
together prior to the second PCR. Each of the second PCR reactions contained 1 ?l of the first PCR
(diluted), 1x PCR buffer, 0.125 mM of each dNTP, 3 pmol of each of the 13 BJ primers, 40 pmol
of mixed bvs primers (Supporting Information Table 1), and 0.5 ?l of the Encyclo polymerase
mix (Evrogen). The amplification protocol was as follows for 17 cycles: 94°C for 20 s, 62°C for 20
s, and 72°C for 50 s. The PCR product was precipitated with ethanol and purified by agarose gel
electrophoresis using a DNA gel extraction kit (Promega, USA). The obtained amplicon was ligated
to adaptors and sequenced on the Ion Torrent PGM (Life Technologies) machine.
CDR3 extraction algorithm
We implemented the following algorithm for the initial CDR3 extraction. The J gene segment is
localized based on the identity of the short 6-nt motif to one of the 13 J genes from the IMGT
database. Then, the region of identity is expanded into the CDR3 until the last matched nucleotide
is encountered. The list of possible J segments for each read is generated to include the longest
identical one(s) and the one(s) that are 2 nt shorter (if any). The V gene segment is then localized
based on the identity of the short 7-nt motif that precedes the conservative Cys to one of the
functional V genes from the IMGT database. Then, the region of identity is expanded in both
directions until the last matched nucleotide is encountered. The list of possible V segments for each
read is generated to include the longest identical one(s) and the one(s) not more than 2 nt shorter (if
any). The D gene segment is localized based on the identity of at least 6 nt between the identified V
and J gene segments. The CDR3 is extracted for each read as the nucleotide sequence between and
including triplets coding for the conservative Cys located in V gene segment and for the
conservative Phe located in J gene segment. The extracted CDR3 sequences are then further
processed as described in the main text.
Conflict of interest
The authors declare no financial or commercial conflict of interest.
MIZ, ZIV, USV, BOV, and TMA optimized the DNA technique and prepared the samples
for mass sequencing. BDA and CDM developed the algorithm for TCR analysis and
performed the statistical calculations. CDM, LYB, and LSA contributed to the overall idea,
organization, and manuscript preparation.
This work was supported by the Molecular and Cell Biology program of the Russian
Academy of Sciences; Russian Science Support Foundation, Russian Foundation for Basic
Research (10-04-01771-?, 11-04-12042-ofi-m) and a grant from the President of the
Russian Federation (MK- 575.2011.4).
We thank Alexey Kurnosov for the help in manuscript preparation.
Kedzierska, K., La Gruta, N. L., Stambas, J., Turner, S. J. and Doherty, P. C.,
Tracking phenotypically and functionally distinct T-cell subsets via T-cell
repertoire diversity. Mol Immunol. 2008. 45: 607-618.
Venturi, V., Quigley, M. F., Greenaway, H. Y., Ng, P. C., Ende, Z. S.,
McIntosh, T., Asher, T. E. et al., A mechanism for TCR sharing between T-cell
subsets and individuals revealed by pyrosequencing. J Immunol. 2011. 186: 4285-
Venturi, V., Chin, H. Y., Asher, T. E., Ladell, K., Scheinberg, P., Bornstein, E.,
van Bockel, D. et al., TCR beta-chain sharing in human CD8+ T-cell responses to
cytomegalovirus and EBV. J Immunol. 2008. 181: 7853-7862.
Robins, H. S., Campregher, P. V., Srivastava, S. K., Wacher, A., Turtle, C. J.,
Kahsai, O., Riddell, S. R. et al., Comprehensive assessment of T-cell receptor
beta-chain diversity in alphabeta T cells. Blood. 2009. 114: 4099-4107.
Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H. and Holt, R. A.,
Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing.
Genome Res. 2009. 19: 1817-1824.
Warren, R. L., Freeman, J. D., Zeng, T., Choe, G., Munro, S., Moore, R.,
Webb, J. R. and Holt, R. A., Exhaustive T-cell repertoire sequencing of human
peripheral blood samples reveals signatures of antigen selection and a directly
measured repertoire size of at least 1 million clonotypes. Genome Res. 2011. 21:
Britanova, O. V., Bochkova, A. G., Staroverov, D. B., Fedorenko, D. A.,
Bolotin, D. A., Mamedov, I. Z., Turchaninova, M. A. et al., First autologous
hematopoietic SCT for ankylosing spondylitis: a case report and clues to
understanding the therapy. Bone Marrow Transplantation. 2012.
Mamedov, I. Z., Britanova, O. V., Bolotin, D. A., Chkalina, A. V., Staroverov,
D. B., Zvyagin, I. V., Kotlobay, A. A. et al., Quantitative tracking of T-cell clones
after haematopoietic stem cell transplantation. EMBO Mol Med. 2011. 3: 201-207.
Robins, H. S., Srivastava, S. K., Campregher, P. V., Turtle, C. J., Andriesen,
J., Riddell, S. R., Carlson, C. S. and Warren, E. H., Overlap and effective size of
the human CD8+ T-cell receptor repertoire. Sci Transl Med. 2010. 2: 47ra64.
Klarenbeek, P. L., Tak, P. P., van Schaik, B. D., Zwinderman, A. H., Jakobs,
M. E., Zhang, Z., van Kampen, A. H. et al., Human T-cell memory consists
mainly of unexpanded clones. Immunol Lett. 133: 42-48.
Nguyen, P., Ma, J., Pei, D., Obert, C., Cheng, C. and Geiger, T. L.,
Identification of errors introduced during high throughput sequencing of the T-cell
receptor repertoire. BMC Genomics. 2011. 12: 106.
Harismendy, O., Ng, P. C., Strausberg, R. L., Wang, X., Stockwell, T. B.,
Beeson, K. Y., Schork, N. J. et al., Evaluation of next generation sequencing
platforms for population targeted sequencing studies. Genome Biol. 2009. 10: R32.
Sydow, J. F. and Cramer, P., RNA polymerase fidelity and transcriptional
proofreading. Curr Opin Struct Biol. 2009. 19: 732-739.
Elnifro, E. M., Ashshi, A. M., Cooper, R. J. and Klapper, P. E., Multiplex PCR:
optimization and application in diagnostic virology. Clin Microbiol Rev. 2000. 13:
Markoulatos, P., Siafakas, N. and Moncany, M., Multiplex polymerase chain
reaction: a practical approach. J Clin Lab Anal. 2002. 16: 47-51.
Bhalla, A. D., Gudikote, J. P., Wang, J., Chan, W. K., Chang, Y. F., Olivas, O.
R. and Wilkinson, M. F., Nonsense codons trigger an RNA partitioning shift. J
Biol Chem. 2009. 284: 4062-4072.
Wang, J., Vock, V. M., Li, S., Olivas, O. R. and Wilkinson, M. F., A quality
control pathway that down-regulates aberrant T-cell receptor (TCR) transcripts by a
mechanism requiring UPF2 and translation. J Biol Chem. 2002. 277: 18489-18493.
Matz, M., Shagin, D., Bogdanova, E., Britanova, O., Lukyanov, S., Diatchenko,
L. and Chenchik, A., Amplification of cDNA ends based on template-switching
effect and step-out PCR. Nucleic Acids Res. 1999. 27: 1558-1560.
Wang, C., Sanders, C. M., Yang, Q., Schroeder, H. W., Jr., Wang, E.,
Babrzadeh, F., Gharizadeh, B. et al., High throughput sequencing reveals a
complex pattern of dynamic interrelationships among human T-cell subsets. Proc
Natl Acad Sci U S A. 2010. 107: 1518-1523.
Figure 1. Abundance distribution of clonotypes sequenced 1 to 30 times (Illumina data)
(A) before any correction, (B) after removal of the low-quality sequencing reads and (C)
after correction of PCR errors and mapping of low-quality sequencing reads. Data shown
are representative of three independent experiments.
Figure 2. In silico spectratyping analysis for sequence datasets of (A) Illumina without
correction, (B) 454 without correction, (C) Ion Torrent without correction and (D) Ion
Torrent after asymmetric correction based on raw flowgram data. “Out-of-frame” CDR3
sequencing reads are shown in light gray. Data shown are representative of three
independent experiments for Illumina and 454 and one experiment for Ion Torrent.
Figure 3. Relative frequency of TCR beta J gene segments, V gene segments, and the 50
most abundant CDR3 clonotypes for the three NGS platforms. Data shown are
representative of three independent experiments for Illumina and 454 and one experiment
for Ion Torrent. Graphs show (A-C) Illumina versus 454 and (D-F) Illumina versus Ion
Torrent. Data are shown as the mean percentage of each TCR J beta or V beta gene of three
independent experiments with the 67% confidence interval.
Figure 4. Relative abundance of selected TCR V beta gene segment variants normalized to
TRBV9. IMGT classification. Data shown are representative of two independent
experiments for flow cytometry, three independent experiments for Illumina, three
independent experiments for 454 and one experiment for Ion Torrent.
Figure 5. Accuracy of relative concentration measurements for the TCR clonotypes on
Illumina. Geometric standard deviations between the three independent experiments (GSD)
was plotted against the average calculated clonotype frequencies (clonotypes were grouped
into subsets of different concentration ranges, frequency borders for the subsets are
indicated by vertical dashed lines). The median (line), 25-75 percentile error rate (box) and
full range (whiskers) for the GSD are shown for the three datasets analyzed after removal
of the low-quality sequencing reads (light gray) and after advanced error correction
algorithm (dark gray). We used geometric standard deviation here because error in relative
concentration is better described by a log-normal distribution than by a normal distribution.
Data shown are representative of a single set of three independent experiments.
Table 1. Typical examples of sequencing reads with PCR errors merged with the dominant “core
aThe dominant “core clonotype,” which determines the nucleotide sequence of the final
clonotype, is shown in bold. V segments (orange), D segments (blue), J segments (green),
and corrected errors (bold red) within the CDR3 nucleotide sequence are shown. Secondary
PCR errors are highlighted in gray. Because of space limitations, a limited number of low-
abundance subvariants is shown.
45 V10-3 J2-1D2
7445 V10-3J2-1 D2
45 V10-3J2-1 D2
45 V10-3 J2-1 D2
1 45 V10-3J2-1 D2
28 45V10-3 J2-1D2
26 Download full-text
Table 2. Key parameters of TCR NGS data correction.
Illumina GAII, 1 lane 454 FLX, 1/16 plateIon Torrent, 314 chip
a Only half of the Illumina reads began from the required 3’-end of the TCR beta fragment, as a single-end
run was used.
bTotal number of sequencing reads successfully mapped to clonotypes. Percentage of raw sequencing
information used is shown in parentheses.
c Low quality was assigned for the sequencing reads that contained low-quality nucleotides within the
identified CDR3, using phred 30 (q) for Illuimna, phred 16 (q) for 454, and flowgram values beyond -
0.2/+0.35 (asymmetric) for Ion Torrent.