BEYOND ATOPY: MULTIPLE PATTERNS OF SENSITIZATION IN RELATION TO
ASTHMA IN A BIRTH COHORT STUDY
Angela Simpson*1, Vincent Y. F. Tan*2, John Winn3, Markus Svensén3, Christopher M. Bishop3,
David E. Heckerman4, Iain Buchan5, Adnan Custovic1
*Joint first authorship
1The University of Manchester, Manchester Academic Health Science Centre, NIHR
Translational Research Facility in Respiratory Medicine, University Hospital of South
Manchester NHS Foundation Trust, Manchester, UK
2Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts
Institute of Technology, Cambridge, MA 02139, USA
3Microsoft Research Cambridge, Cambridge, UK
4eScience Research Group, Microsoft Research, Redmond, Washington 98052, USA
5The University of Manchester, Northwest Institute for Bio-Health Informatics (NIBHI),
Correspondence and requests for reprints:
Dr Angela Simpson, University of Manchester, ERC Building, Second floor, Wythenshawe
Hospital, Manchester M23 9LT, UK
Phone: +44 161 291 5871, Fax: +44 161 291 5730, Email: firstname.lastname@example.org
Funding: Asthma UK Grant No 04/014, Moulton Charitable Trust, The James Trust and
Running title: Atopic latent vulnerability and asthma
Descriptor Number: 101
Page 1 of 51
Media embargo until 2 weeks after above posting date; see thoracic.org/go/embargo
AJRCCM Articles in Press. Published on February 18, 2010 as doi:10.1164/rccm.200907-1101OC
Copyright (C) 2010 by the American Thoracic Society.
In epidemiological studies and clinical practice, children are classified as atopic if they have a
positive IgE or skin prick test. By adopting a machine learning approach, we have identified that
IgE antibody responses do not reflect a single phenotype of atopy, but multiple different atopic
vulnerabilities. We have demonstrated that only one of these atopic classes (Multiple Early
Atopic Vulnerability) predicts asthma.
Abstract Word Count: 249
This article has an online data supplement, which is accessible from this issue's table of content
online at www.atsjournals.org
Page 2 of 51
Background: The pattern of IgE response (over time or to specific allergens) may reflect
different atopic vulnerabilities which are related to the presence of asthma in a
fundamentally different way from current definition of atopy.
Methods: In a population-based birth cohort in which multiple skin and IgE tests have
been taken throughout childhood, we used a machine learning approach to cluster
children into multiple atopic classes in an unsupervised way. We then investigated the
relation between these classes and asthma (symptoms, hospitalizations, lung function
and airway reactivity).
Results: A five-class model indicated a complex latent structure, in which children with
atopic vulnerability were clustered into four distinct classes (Multiple Early [112/1053,
10.6%]; Multiple Late [171/1053, 16.2%]; Dust Mite [47/1053, 4.5%]; and Non-dust Mite
[100/1053, 9.5%]), with a fifth class describing children with No Latent Vulnerability
[623/1053, 59.2%]. The association with asthma was considerably stronger for Multiple
Early compared to other classes and conventionally defined atopy (odds ratio [95% CI]:
29.3 [11.1-77.2] vs. 12.4 [4.8-32.2] vs. 11.6 [4.8-27.9] for Multiple Early class vs. Ever
Atopic vs. Atopic age 8). Lung function and airway reactivity were significantly poorer
amongst children in Multiple Early class. Cox regression demonstrated a highly
significant increase in risk of hospital admissions for wheeze/asthma after age 3 years
only amongst children in the Multiple Early class (HR 9.2 [3.5-24.0], p<0.001).
Conclusions: IgE antibody responses do not reflect a single phenotype of atopy, but
several different atopic vulnerabilities which differ in their relation with asthma presence
Key words: asthma, atopy, unsupervised clustering, Bayesian inference, machine
learning in epidemiology
Page 3 of 51
Atopy is a term describing the tendency to become IgE-sensitized to common allergens
to which most people are exposed but don’t have a prolonged IgE antibody response(1,
2). In most literature, atopic sensitization is defined as a positive allergen-specific serum
IgE (sIgE) test or skin prick test (SPT) to any common food or inhalant allergen(s), and
atopic sensitization thus defined remains the single strongest risk-factor for asthma in the
western world(3-5). Although evidence from twin and family studies suggests a strong
genetic component of atopy(6), more than a decade of intensive work has failed to
identify causal associations with genetic variants that are consistently replicated(7).
Similarly, the increase in prevalence of atopy since the 1960s suggests an important
environmental component, but no environmental exposure has consistently been
associated with the development of the atopy(8). We propose that one reason for this is
phenotypic heterogeneity, as the diagnostic label of “atopy” may encompass many
different phenotypes with different aetiologies, not all of which are associated with
symptomatic disease. The conventional epidemiological approach does not reflect the
complexities of disease; consequently, reproducible genetic and environmental studies
We speculate that the presence of positive ‘allergy test’ (either sIgE or SPT) does not
equate to the atopic phenotype associated with symptomatic allergic disease. We
hypothesize that more useful information may be obtained by identifying common
underlying statistical clusters that are characterized by IgE responses. Several recent
publications have demonstrated the utility of using a clustering approach in
multidimensional data to identify different asthma phenotypes(9-12). Results of latent
class analysis on a large dataset collected annually over a 7 year period identified six
childhood wheezing phenotypes, two of which had not been described previously(10).
Unsupervised hierarchical cluster analysis identified five distinct clinical phenotypes of
Page 4 of 51
adult asthma, emphasising the need for new approaches for classification of disease
phenotypes(11). We conducted a Principle Component Analysis using answers to
multiple questions relating to wheeze to identify five syndromes of coexisting symptoms
which are likely to reflect different underlying pathophysiologic processes(12).
Ideally, one should aim to model all available data (i.e. multiple measurements at
multiple time points) to identify latent variables which best describe the structure of the
data. Such models would need to be tailored to individual datasets, to precisely encode
prior knowledge and to scale up to large volumes of data. A machine learning approach
using Bayesian inference for unsupervised learning of latent variables to identify
structure within the data is used commonly by computer scientists for problems in many
other fields, and is ideally suited to this task. We applied this approach to a large
complex data-set from a population-based birth cohort in which measures of allergic
sensitization (both sIgE and SPT) to multiple inhalant and food allergens have been
taken throughout childhood, to assign children to atopic latent classes in an
unsupervised way, thus avoiding constraints placed by pre-specified ideas of the nature
and number of such classes. We sought to investigate whether these different latent
atopic classes were related to the presence or absence of asthma, in ways that are
fundamentally different from current diagnostic categories.
Page 5 of 51
Study design, setting and participants
Manchester Asthma and Allergy Study is a population-based birth cohort(13-16) (detailed
description in Online supplement). Participants were recruited prenatally and followed
prospectively, attending review clinics at ages 1, 3, 5 and 8 years. The study is
registered as ISRCTN72673620 and approved by the Local Research Ethics Committee
(04/Q1403/45). Written informed consent was obtained from all parents, and children
gave their assent.
Definition of variables
Atopic sensitization: We ascertained atopic sensitization by skin-prick tests (Hollister-
Stier, VA, USA) and measurement of sIgE (ImmunoCAP, Phadia, Uppsala, Sweden) at
each time point to a panel of inhalant and food allergens (summarized in Table E2). We
defined allergen-specific sensitization as mean wheal diameter at least 3mm greater than
the negative control and/or specific IgE≥0.35 kU/l. The conventional definition
considered a child to be atopic if (s)he had allergen-specific sensitization to at least one
allergen. Children with any positive test (SPT or sIgE) at any time point were considered
to be “Atopic ever”.
Wheeze: A validated questionnaire was interviewer-administered to collect information
on parentally-reported symptoms, physician-diagnosed illnesses and treatments
received. Current wheeze was defined as wheeze in the past 12 months.
Based on prospectively collected data, children were assigned to the following wheeze
phenotypes: No wheezing–no wheezing ever at any follow-up by age 8 years; Transient
early wheezing–wheezing only during the first 3 years; Late-onset wheezing–wheezing
started after age 3 years; Persistent wheezing–wheezing during the first 3 years,
wheezing in the previous 12 months at ages 5 and 8 years. Intermittent wheezing–
wheezing at one time point during the first 5 years, wheezing at age 8 years.
Page 6 of 51
Lung function: We measured specific airway conductance (sGaw) using whole-body
plethysmography at age 3 and 5 years(15, 17) and FEV1 using spirometry at age 8 years
Airway hyper-reactivity (AHR-methacholine challenge): Assessed at age 8 years in a 5-
step protocol using quadrupling doses of methacholine (Table E1) according to ATS
guidelines(18). A dose-response ratio was calculated and transformed as previously
Asthma: We used a stringent epidemiological definition of asthma at age 8 years as
symptomatic airway hyper-reactivity (i.e. presence of current wheeze and positive
Hospital admission for asthma/wheeze: A trained physician reviewed the written and
computerized primary care medical records and extracted the data on hospitalizations for
wheeze or asthma(21).
We took a machine learning approach to the data analysis. Using a Hidden Markov
Model (HMM)(22), all available SPTs and sIgEs (collected at review clinics at ages 1, 3,
5 and 8 years) were used to infer one multinomial latent variable per child so as to
cluster the children in an unsupervised manner into different sensitization classes (Figure
1). At the core of the model are the 4 dichotomous latent Acquired Sensitization
variables for each allergen which are linked together in a Markov chain across the 4 time
points. We inferred time-dependent transition probabilities (i.e. the probabilities of
gaining and losing sensitization at each age) which were assumed to be shared by all
children in each sensitization class, but were allowed to differ between classes.
Inference: Inference was performed using Infer.NET
(http://research.microsoft.com/infernet), a Microsoft-owned library for large-scale
Bayesian inference, which is now freely available for research purposes. We used
Page 7 of 51
Infer.NET to infer the false and true positive rates of the SPTs and sIgEs, missing values,
the class-specific state-transition probabilities, the observation (emission) probabilities,
the acquired sensitization variables and also the sensitization class for each child. An
approximate Bayesian inference method (Variational Message Passing-VMP)(23) was
used to perform the inference in an efficient manner.
Robustness and reproducibility: Robust and reproducible clustering was achieved by
training the sensitisation HMM multiple times on different subsets of children and
selecting the clustering which both gave good predictions on the remaining children and
which was robust to the subset of children selected. Reproducibility was confirmed by
computing confusion matrices between different replications of the clustering process
(see detailed description and confusion matrices in Online supplement).
Handling the missing data: Variables corresponding to missing data values were
included in the model but treated as unobserved. Distributions over these missing data
values were computed using VMP based on the available measurements.
Sensitization Class: This is a multinomial variable indicating to which sensitization class
each child belongs (out of between 2 and 5 classes). The model assumes that each
child belongs to one of these classes. We investigated a two-class and a five-class
model (see Online supplement). No assumptions were made about the nature of each
class. During inference, a distribution was computed for each child giving the probability
of their belonging in each class. For further analysis, we assumed the child belonged to
the highest probability class.
We then investigated the association between the classes we had inferred in a
completely unsupervised manner and the clinical outcomes using appropriate statistical
methods (chi-squared test, logistic regression, Kaplan-Meier univariate estimates and
Cox regression multivariate estimates of survival/clinical status). Results are presented
as the main effect with 95% confidence intervals (CI).
Page 8 of 51
Of the 1186 participants with any evaluable data, 133 who were randomized into the
primary prevention study(24, 25) were excluded from the analysis of the association
between clinical outcomes and inferred sensitization class. All remaining children with
available clinical outcomes were included at each time point (Table E2). There was no
difference in parental history of allergic disease between children with or without missing
data on clinical outcomes (data available on request).
At age 8 years, 18% (163/905) of children had current wheeze; 13.7% (124/905) were
persistent wheezers and 8.1% (45/555) had asthma (symptomatic AHR). Data collected
from primary care records revealed that 16.7% (136/814) of children had been admitted
to hospital with wheezing/asthma on at least one occasion during the first eight years of
life. Using conventional definitions, of 827 children who had either SPT or sIgE
measured at age 8 years, 322 (38.9%) were considered atopic; 1029 children had at
least one assessment of atopic status throughout the duration of the follow-up, of whom
441 (42.9%) were considered to be atopic ever.
The structure of the classes was inferred in a completely unsupervised manner using all
data (SPT and sIgE) from all four time points with missing data inferred using Variational
Message Passing(23) (i.e. we did not assume beforehand how the children will be
clustered, the unsupervised learning algorithm automatically discovered the latent
structure) under the assumption that data was missing at random.
We present the results with the sensitization state being considered to have two classes
(best reflecting a conventional assignment to atopy/no atopy), and five classes (which
better captured the underlying structure of the data).
Two-class model: The children were assigned as having either a Latent atopic
vulnerability (280/1053, 26.6%) or No latent atopic vulnerability (773/1053, 73.4%)
Page 9 of 51
(Figure E1); 161 of 440 children (36.6%) who were sensitized on at least one occasion
were classified as not vulnerable. Compared with conventional definitions, there was
complete agreement in 86.0% (Atopy age 8 years) and 84.0% of cases (Atopy ever).
Five-class model: This model indicated a more complex latent structure incorporating
time-varying probabilities of the gain and the loss of sensitization (Figure E2). The
children with latent atopic vulnerability were clustered into four distinct sensitization
classes, which, based on our interpretation of the characteristics of each class, we
assigned as the following:
(1) Non-dust Mite Atopic Vulnerability (100/1053, 9.5%)
(2) Dust Mite Atopic Vulnerability (47/1053, 4.5%)
(3) Multiple Late Atopic Vulnerability (171/1053, 16.2%)
(4) Multiple Early Atopic Vulnerability (112/1053, 10.6%)
The final class comprised children with No Latent Vulnerability (623/1053, 59.2%). In
this model, 61/440 (13.9%) children who were atopic ever were classified as having No
Latent Vulnerability; amongst 322 children who were atopic at age 8, 36 (11.2%) were
classified as having No Latent Vulnerability. All but one child in the Multiple Early class
were atopic at age 8 years using conventional definition, but the Multiple Early class
comprised only 28.0% of those atopic at age 8 years (Table E3).
To determine the appropriate number of classes, differing numbers of clusters were
tested as to their ability to predict the sensitization state of children where that state was
artificially made missing. This imputation process suggests that between 3 and 5 clusters
were justified (Figure E3 in the Online supplement) and so a 5-class model was selected
since it exposed the most information about the structure of the data set. The choice of 5
classes was also validated when considering the confusion matrices found when the
clustering process was replicated (see Tables E4 and E5 in the Online supplement). For
the 5-class case, there was very little confusion between different clusterings, indicating
Page 10 of 51
that the 5-class clustering is robust. For example, for the Multiple-Early class 111 of the
112 children assigned to this class in the reference clustering were repeatedly assigned
to the same class in other 5-class clusterings.
Sensitization class and clinical outcomes
We went on to ascertain relationships between atopy defined conventionally (atopic ever,
atopic at age 8 years), the novel latent classes (two-class and five-class models) and
clinical phenotypes associated with asthma (current wheeze, persistent wheeze,
symptomatic AHR, hospital admission with asthma/wheeze), adjusting for gender. The
results are presented in Figures 2 and E4 and Table E6. The relationships with clinical
outcomes for ever atopic, atopic at age 8 years and the two-class model were not
materially different. However, for the five-class model, it was apparent that there were
marked differences between the four classes of atopic vulnerability, in that the
associations with clinical outcomes were considerably stronger for Multiple Early
compared to other classes, the two-class model and conventionally defined atopy (e.g.
for symptomatic AHR, odds ratio [95% CI]: 29.3 [11.1-77.2] vs. 12.4 [4.8-32.2] vs. 11.6
[4.8-27.9] vs. 9.2 [4.5-18.9] for Multiple Early class vs. Ever Atopic vs. Atopic age 8 vs.
Latent Atopic vulnerability-two-class model; Table E6). There was a very strong
association between Multiple Early class and persistent wheezing (12.9 [6.8-24.4].
These finding indicated that IgE antibody responses do not reflect a single phenotype of
atopy, but several atopic vulnerabilities which differ in their relationship with asthma. To
further test this, we proceeded to investigate the relationship between markers of asthma
severity (objective measures of lung function and airway reactivity, hospital admissions)
within the five-class model.
Lung function, airway reactivity and hospital admissions in the five-class model
In the univariate analysis we found a significant association between sGaw at age 3 and 5
years, FEV1, FEV1/FVC ratio and DRR at age 8 years and five-class latent variable
Page 11 of 51
(Table E7). Multiple comparison test (Tukey) revealed that for all measures of lung
function and airway reactivity, lung function was significantly poorer amongst children in
Multiple Early class compared to those with No Latent Atopic Vulnerability, with little
differences between the other three classes and the No Latent Vulnerability class (Table
E7, Figures E5-E9).
In the multiple ANOVA models adjusted for gender, maternal smoking and wheezing
(sGaw, FEV1/FVC ratio and DRR) and gender, wheezing, maternal smoking and height
(FEV1), children in the Multiple Early class had significantly poorer lung function
compared to those in the No Latent Vulnerability class (sGaw age 3, p=0.02; sGaw age 5,
p=0.01; age 8 FEV1, FEV1/FVC ratio and DRR: p<0.001; Table 1). There were no
significant differences in lung function between the other three classes and the No Latent
Vulnerability class, apart from airway reactivity (DRR) being significantly higher in the
Multiple Late class (p=0.05, Table 1).
Kaplan-Meier plots demonstrating the age of the first hospital admission with
wheeze/asthma in relation to the five-class model are presented in Figure 3A. The
results of a Cox regression that included the five classes, gender and maternal smoking
indicated a highly significant association between the risk of hospital admission and five-
class model (P<0.0001), with a risk of hospital admission increasing amongst children in
the Multiple Early class (hazard ratio (HR) [95% CI], 5.1 [2.8-9.3], p<0.001), Dust Mite
class (3.4 [1.4-8.2.7], p=0.004) and Non-dust Mite (2.5 [1.2-5.3, p=0.02]), but not those in
the Multiple Late class (1.3 [0.6-2.9, p=0.4]). In order to remove the effect of hospital
admission for wheeze caused only by early-life virus infections, we have reanalyzed the
data on the time to the first hospital admission with wheeze/asthma amongst children
who had a hospital admission after age 3 years (Kaplan-Meier plot, Figure 3B). Cox
regression demonstrated a highly significant increase in risk only amongst children in the
Multiple Early class (HR 9.2 [3.5-24.0], p<0.001).
Page 12 of 51
We have demonstrated that genuinely novel phenotypes of atopy can be revealed by
adopting a machine learning approach which takes full advantage of the data-intensive
environment provided by a birth cohort study. Machine learning techniques identified
latent structures within the data which may accurately reflect “unbiased” phenotypes of
atopy and avoid constraints of investigator-imposed classifications. Our results suggest
that IgE antibody responses do not reflect a mere presence or absence of atopy, but
instead multiple atopic vulnerability classes. The validity of these classes was tested by
examining their relations to the presence and severity of asthma and measures of lung
function, which demonstrated that different atopic vulnerabilities (i.e. different phenotypes
of atopy) differ markedly in their relationship with asthma. It is not the presence or
absence of specific IgE antibodies, but the pattern of the response (age at development,
type and number of specific allergens involved) that has a fundamental effect on the
clinical expression of asthma. It is of note that less than a third of children who would
have been considered atopic at age 8 years using conventional diagnostic criteria were
in the class most strongly associated with asthma (Multiple Early), whereas there was
little appreciable increase in risk of asthma amongst those in the other classes. We
propose that positive specific IgE or positive skin prick tests do not equate to atopy, but
should be viewed as intermediate phenotypes of a true atopic vulnerability. This may be
analogous to asthma, where a collection of intermediate phenotypes can objectively be
measured (e.g. peak flow variability, airway hyper-reactivity, or an obstructive spirometric
pattern), but individually their presence does not equate to a diagnosis of asthma(26).
Strengths and limitations
We recognize that Bayesian learning applied to a longitudinal dataset is exploratory and
hypothesis generating, rather than confirmatory. However, the classes we identified
Page 13 of 51
seem intuitively correct (i.e. have face validity), and we have demonstrated significant
relationships with asthma, lung function and airway reactivity (i.e. have content validity).
We acknowledge the computational complexity and intensity of this analysis. It is
important to emphasise that this is not a simple “black box” or the “data-mining”
approach; the analysis is informed by and capitalizes on the wealth of knowledge which
already exist on the problem. Once determined, the classes may become clinical
outcome variables in their own right and can be used in further analyses. Such
dimensionality reduction reduces the need for repeated cross-sectional analysis, as often
seen in longitudinal datasets, and reduces the need for multiple testing.
A strength of our model is that is generative, enabling missing measurements to be
handled meaningfully. A further strength of the study is that the prevalence of atopic
sensitization among the parents of the children in our cohort(27, 28) is similar to that of
young adults in the UK(29), suggesting that the cohort is representative of the general
population. However, it would be of great value and importance to explore similar
approaches in the other large birth cohort studies. We recognize that the number of
relevant classes might be different to the five reported here, and further replications
would be desirable.
We acknowledge that our findings do not have an immediate impact on clinical practice.
However, we argue that our approach to data analysis will advance our understanding of
the etiology of asthma.
Interpretation of the study
The study of asthma at the population level to date has been predominantly hypothesis
driven, often focussing on ill-defined, over-simplified phenotypes, using reductionist
approaches to causality. Whilst identifying some major independent determinants of
disease, this approach does not fully reflect the complexity of disease. Furthermore, it
fails to take full advantage of the richness of the available datasets collected in birth
Page 14 of 51
cohort studies. We propose that one of the reasons for contradictory findings reported by Download full-text
a number of genetic and environmental studies aiming to elucidate the mechanisms of
asthma is phenotypic heterogeneity and poor phenotype definition.
In epidemiological studies of allergic diseases investigators collect large volumes of
information, often at multiple time points. Data on sensitization collected over a time
series may be used to assign a phenotype based on distinctive patterns of results (e.g.
early, late or very late IgE sensitization(30), mono- or poly-sensitization(30), remission or
persistence(30), declining, flat or increasing pattern(31)). These categories are often
imposed by the investigators, and do not necessarily reflect the substructure within the
dataset. Ideally, one should aim to model all the data to identify a single multinomial
latent variable which best describes the structure of the data. By using a machine
learning approach, we have demonstrated that diagnostic label of “atopy” encompasses
several different phenotypes which may have different etiologies.
Since these classes better reflect the presence and severity of disease, we propose that
further efforts be made to develop new diagnostic tests that will allow clinicians to better
differentiate between the true atopic classes than the currently available tests. Current
reagents for skin testing and specific IgE measurement are based on whole extracts
containing multiple proteins, many of which are recognized by IgE antibodies(32) (e.g.,
for dust mite D pteronyssinus there are >20 recognized allergens(33)). We speculate
that response to different individual proteins within an allergen may be associated with
different atopic classes (and consequently different clinical phenotypes). Utilization of this
component-based approach may offer novel diagnostic possibilities and improve the
value of allergy diagnosis, allowing practicing physicians more accurate diagnosis based
on a single measurement at the time of presentation.
We have previously extended the observation that sensitization to inhalant allergens is a
risk factor for wheezing by demonstrating that the level of specific IgE antibodies offers
Page 15 of 51