PresentationPDF Available

Predictive modelling & data science

Authors:

Abstract

International Summer School lecture
International Summer School
Data Science
Tom Kelsey
School of Computer Science
University of St Andrews
https://tom.host.cs.st-andrews.ac.uk/
twk@cst-andrews.ac.uk
Tom Kelsey ISS-2016 2016-06-27 1 / 23
Data Science
Each branch of Science has a quantity or unit that underpins
research activity
Physicists investigate energy
Chemists investigate molecules
Biologists investigate organisms
Economists investigate money
...
Computer Scientists investigate data
This talk is about making predictions from data
Tom Kelsey ISS-2016 2016-06-27 2 / 23
Motivating example: Titanic
On April 15, 1912, during her maiden voyage, the Titanic sank
after colliding with an iceberg, killing 1502 out of 2224
passengers and crew. Real data about the passengers
Type of passenger
Ticket class, ticket price, cabin, port of embarkation, ...
The families of passenger
Age, gender, number of siblings/spouses aboard, number of
parents/children aboard, ...
Did they survive?
Tom Kelsey ISS-2016 2016-06-27 3 / 23
Titanic - class of ticket
Tom Kelsey ISS-2016 2016-06-27 4 / 23
Titanic - gender of passenger
Tom Kelsey ISS-2016 2016-06-27 5 / 23
Titanic - importance of predictor variables
Tom Kelsey ISS-2016 2016-06-27 6 / 23
Titanic Kaggle Competition
Given some of the data, can we accurately predict survival for
passengers we don’t know about?
https://www.kaggle.com/c/titanic
Kaggle provide data that can be used to derive predictive
models
some entries are missing
Kaggle provide data for testing candidate models
Kaggle keep back some data for assessing model
performance
Tom Kelsey ISS-2016 2016-06-27 7 / 23
Titanic - a decision tree model
Tom Kelsey ISS-2016 2016-06-27 8 / 23
Why is any of this important?
We use predictive models to make decisions
Buy/sell, offer/decline credit, raise/lower prices, focus on
"good" customers, ...
If my predictive model is better than yours, I will be more
successful
In business, science, medicine, ...
Tom Kelsey ISS-2016 2016-06-27 9 / 23
Data and energy firms can have similar worth
Tom Kelsey ISS-2016 2016-06-27 10 / 23
Better predictions means more success
Tom Kelsey ISS-2016 2016-06-27 11 / 23
Model quality
All models are wrong but some models are useful.
GEORGE E. P. BOX "ROBU STNESS IN T HE STRATEGY OF
SCIENTIFIC MODEL BUIL DIN G" (MAY 1979)
No expectation of the same model arising from the same
data by two researchers...
...or the same researcher doing it twice
This isn’t a problem
We get to
use
models, and we can then easily classify them:
poor predictive performance – terrible model, scrap it
reasonable predictive performance – OK model, use it for
now but maybe think about looking for improvements
good predictive performance – good model, keep it
We now derive a simple predictive model
Tom Kelsey ISS-2016 2016-06-27 12 / 23
Tom Kelsey ISS-2016 2016-06-27 13 / 23
Example: binary class prediction
Suppose we have a new observation we wish to predict for,
which is a medium-income student-youth with fair credit rating.
x= [Age =Youth,Income =Medium,Student =Yes,Credit =
Fair]
We use probabilities: for example P(Friday) = 1/7
We also use conditional probabilities:
P(Friday|TomorrowIsSaturday) = 1
Seek the class Cithat makes P(x|Ci)P(Ci)as big as possible
P
(
C
i)
are straightforward: P
(
Yes
) =
9
/
14 and P
(
No
) =
5
/
14
We now calculate P(xj|Ci)for j=1, . . . , 4 and i=1, 2
Tom Kelsey ISS-2016 2016-06-27 14 / 23
Example: binary class prediction
We now calculate P(xj|Ci)for j=1, . . . , 4 and i=1, 2
1P(Age =Youth|Y) = 2
9,P(Age =Youth|No) = 3
5.
2P(Income =Med|Y) = 4
9,P(Income =Med|No) = 2
5.
3P(Student =Yes|Y) = 6
9,P(Student =Yes|No) = 1
5.
4P(Credit =Fair|Y) = 6
9,P(Credit =Fair|No) = 2
5.
Tom Kelsey ISS-2016 2016-06-27 15 / 23
Example: binary class prediction
Assuming conditional independence:
P(x|Y) = 2×4×6×6
9×9×9×9=0.044
P(x|N) = 3×2×1×2
5×5×5×5=0.019
Ultimately giving us:
P(x|Y)P(Y) = 9
14 ×0.044 0.028
P(x|N)P(N) = 5
14 ×0.019 0.007
So choose the class Cto be Buy =Yes as our prediction - it is
more likely given x. So spending on advertising to this type of
customer is likely to lead to higher profits.
Tom Kelsey ISS-2016 2016-06-27 16 / 23
Example: predict gender from given name
x1is a binary variable: Y for European, N for not
x2is number of vowels: 2 for two or fewer, 3 for three or
more
x3is length of given name: S for 4 or fewer, M for 5 through
7, L for 8 or more
x4is a binary variable: Y if the given name ends in a vowel,
N otherwise
yis our gender variable: M or F (for this lecture)
If we collect some data, what is our prediction based on your
name?
Tom Kelsey ISS-2016 2016-06-27 17 / 23
Worked example 2: Our name data set
ID x1x2x3x4y
D1 N 2 S N M
D2 N 2 M Y M
D3 N 2 M Y M
D4 N 3 L N M
D5 Y 3 L Y F
D6 Y 3 M N F
D7 Y 3 L Y F
D8 N 2 L Y M
D9 N 2 S N F
D10 N 3 L N F
D11 Y 2 S N M
Get x-values for your name, then do the calculations
Tom Kelsey ISS-2016 2016-06-27 18 / 23
Confusion Matrix
actual
value
Prediction outcome
f m total
f0True
Female
False
Male F0
m0False
Female
True
Male M0
total F M
Tom Kelsey ISS-2016 2016-06-27 19 / 23
Confusion Matrix for a good predictive model
actual
value
Prediction outcome
f m total
f078 4 F0
m03 83 M0
total F M
Tom Kelsey ISS-2016 2016-06-27 20 / 23
Predictive Models
The key thing is that you can’t cheat!
We get a candidate model from known data
We optimise and select models on known data
Our final model is used on unseen data
Performance will be good, bad or indifferent
The Kaggle website has many competitions – the winners
are the best models for predicting based on unseen data
Tom Kelsey ISS-2016 2016-06-27 21 / 23
Summary
I do predictive modelling using Medical data – see the references
at the end of this presentation
Many other important areas of Data Science
Networks
Visualisation
Algorithms
Security & privacy (note that no first names were given
explicitly in the example)
...
Thanks for listening.
Tom Kelsey ISS-2016 2016-06-27 22 / 23
References
1A Validated Normative Model for Human Uterine Volume from Birth to Age 40
Years. Kelsey TW, Ginbey E, et al. PLoS One 11(6):e0157375; 2016
2A Normative Model of Serum Inhibin B in Young Males. Kelsey TW, Miles A, et
al. PLoS One 11(4):e0153843; 2016
3Accuracy of circulating adiponectin for predicting gestational diabetes: a
systematic review and meta-analysis. Iliodromiti S, Sassarini J, et al. Diabetologia
59(4):692-9; 2016
4An externally validated age-related model of mean follicle density in the cortex
of the human ovary. McLaughlin M, Kelsey TW, et al. J Assist Reprod Genet
32(7):1089-95; 2015
5The Relationship Between Variation in Size of the Primordial Follicle Pool and
Age at Natural Menopause. Depmann M, Faddy MJ, et al. J Clin Endocrinol
Metab 100(6):E845-51; 2015
6A validated age-related normative model for male total testosterone shows
increasing variance but no decline after age 40 years. Kelsey TW, Li LQ, et al.
PLoS One 9(10):e109346; 2014
7Ovarian volume throughout life: a validated normative model. Kelsey TW,
Dodwell SK, et al. PLoS One 8(9):e71465; 2013
Tom Kelsey ISS-2016 2016-06-27 23 / 23
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Transabdominal pelvic ultrasound and/or pelvic Magnetic Resonance Imaging are safe, accurate and non-invasive means of determining the size and configuration of the internal female genitalia. The assessment of uterine size and volume is helpful in the assessment of many conditions including disorders of sex development, precocious or delayed puberty, infertility and menstrual disorders. Using our own data from the assessment of MRI scans in healthy young females and data extracted from four studies that assessed uterine volume using transabdominal ultrasound in healthy females we have derived and validated a normative model of uterine volume from birth to age 40 years. This shows that uterine volume increases across childhood, with a faster increase in adolescence reflecting the influence of puberty, followed by a slow but progressive rise during adult life. The model suggests that around 84% of the variation in uterine volumes in the healthy population up to age 40 is due to age alone. The derivation of a validated normative model for uterine volume from birth to age 40 years has important clinical applications by providing age-related reference values for uterine volume.
Article
Full-text available
Inhibin B has been identified as a potential marker of Sertoli cell function in males. The aim of this study is to produce a normative model of serum inhibin B in males from birth to seventeen years. We used a well-defined search strategy to identify studies containing data that can contribute to a larger approximation of the healthy population. We combined data from four published studies (n = 709) and derived an internally validated model with high goodness-of-fit and normally distributed residuals. Our results show that inhibin B increases following birth to a post-natal peak of 270 pg/mL (IQR 210-335 pg/mL) and then decreases during childhood followed by a rise at around 8 years, peaking at a mean 305 pg/mL (IQR 240-445 pg/mL) at around age 17. Following this peak there is a slow decline to the standard mature adult normal range of 170 pg/mL (IQR 125-215 pg/mL). This normative model suggests that 35% of the variation in Inhibin B levels in young males is due to age alone, provides an age-specific reference range for inhibin B in the young healthy male population, and will be a powerful tool in evaluating the potential of inhibin B as a marker of Sertoli cell function in pre-pubertal boys.
Article
Full-text available
Aims/hypothesis Universal screening for gestational diabetes mellitus (GDM) has not been implemented, and this has had substantial clinical implications. Biomarker-directed targeted screening might be feasible. We sought to determine the accuracy of circulating adiponectin for early prediction of GDM. Methods A systematic review and meta-analysis of the literature to May 2015 identified studies in which circulating adiponectin was measured prior to a diagnosis of GDM. Data on diagnostic accuracy were synthesised by bivariate mixed effects and hierarchical summary receiver operating characteristic (HSROC) models. Results Thirteen studies met the eligibility criteria, 11 of which (2,865 women; 794 diagnosed with GDM) had extractable data. Circulating adiponectin had a pooled diagnostic odds ratio (DOR) of 6.4 (95% CI 4.1, 9.9), a summary sensitivity of 64.7% (95% CI 51.0%, 76.4%) and a specificity of 77.8% (95% CI 66.4%, 86.1%) for predicting future GDM. The AUC of the HSROC was 0.78 (95% CI 0.74, 0.81). First trimester adiponectin had a pooled sensitivity of 60.3% (95% CI 46.0%, 73.1%), a specificity of 81.3% (95% CI 71.6%, 88.3%) and a DOR of 6.6 (95% CI 3.6, 12.1). The AUC was 0.79 (95% CI 0.75, 0.82). Pooled estimates were similar after adjustment for age, BMI or specific GDM diagnostic threshold. Conclusions/interpretation Pre-pregnancy and early pregnancy measurement of circulating adiponectin may improve the detection of women at high risk of developing GDM. Prospective evaluation of the combination of adiponectin and maternal characteristics for early identification of those who do and do not require OGTT is warranted.
Article
Full-text available
The ability to accurately estimate a woman's ovarian reserve by non-invasive means is the goal of ovarian reserve prediction. It is not known whether a correlation exists between model-predicted estimates of ovarian reserve and data generated by direct histological analysis of ovarian tissue. The aim of this study was to compare mean non-growing follicle density values obtained from analysis of ovarian cortical tissue samples against ovarian volume models. Non-growing follicle density values were obtained from 13 ovarian cortical biopsies (16-37 years). A mean non-growing follicle density was calculated for each patient by counting all follicles in a given volume of biopsied ovarian cortex. These values were compared to age-matched model generated densities (adjusted to take into consideration the proportion of ovary that is cortex) and the correlation between data sets tested. Non-growing density values obtained from fresh biopsied ovarian cortical samples closely matched model generated data with low mean difference, tight agreement limits and no proportional error between the observed and predicted results. These findings validate the use of the adjusted population and ovarian volume models, to accurately predict mean follicle density in the ovarian cortex of healthy adult women.
Article
Full-text available
Menopause has been hypothesized to occur when the non-growing follicle (NGF) number falls below a critical threshold. Age at natural menopause (ANM) can be predicted using NGF numbers and this threshold. These predictions support the use of ovarian reserve tests, reflective of the ovarian follicle pool, in menopause forecasting. To investigate the hypothesis that age-specific NGF numbers reflect age at natural menopause. Histologically derived NGF numbers obtained from published literature (n=218) and distribution of menopausal ages derived from the population based Prospect-EPIC cohort (n=4037) were combined. NGF data were from single ovaries that had been obtained post-natally for various reasons, such as elective surgery or autopsy. From the Prospect-EPIC cohort, women aged 58 years and older with a known ANM were selected. None Main Outcome Measure(s): Conformity between observed age at menopause in the Prospect-EPIC cohort and NGF-predicted age at menopause from a model for age-related NGF decline constructed using a robust regression analysis. A critical threshold for NGF number was estimated by comparing the probability distribution of age at which NGF numbers fall below this threshold with the observed distribution of ANM from the Prospect-EPIC cohort. The distributions of observed age at natural menopause and predicted age at natural menopause showed close conformity. The close conformity observed between NGF-predicted and actual age at natural menopause supports the hypothesis that that the size of the primordial follicle pool is an important determinant for the length of the individual ovarian lifespan and supports the concept of menopause prediction using ovarian reserve tests, such as anti-Müllerian hormone and antral follicle count, as derivatives of the true ovarian reserve.
Article
Full-text available
The diagnosis of hypogonadism in human males includes identification of low serum testosterone levels, and hence there is an underlying assumption that normal ranges of testosterone for the healthy population are known for all ages. However, to our knowledge, no such reference model exists in the literature, and hence the availability of an applicable biochemical reference range would be helpful for the clinical assessment of hypogonadal men. In this study, using model selection and validation analysis of data identified and extracted from thirteen studies, we derive and validate a normative model of total testosterone across the lifespan in healthy men. We show that total testosterone peaks [mean (2.5-97.5 percentile)] at 15.4 (7.2-31.1) nmol/L at an average age of 19 years, and falls in the average case [mean (2.5-97.5 percentile)] to 13.0 (6.6-25.3) nmol/L by age 40 years, but we find no evidence for a further fall in mean total testosterone with increasing age through to old age. However we do show that there is an increased variation in total testosterone levels with advancing age after age 40 years. This model provides the age related reference ranges needed to support research and clinical decision making in males who have symptoms that may be due to hypogonadism.
Article
Full-text available
The measurement of ovarian volume has been shown to be a useful indirect indicator of the ovarian reserve in women of reproductive age, in the diagnosis and management of a number of disorders of puberty and adult reproductive function, and is under investigation as a screening tool for ovarian cancer. To date there is no normative model of ovarian volume throughout life. By searching the published literature for ovarian volume in healthy females, and using our own data from multiple sources (combined n = 59,994) we have generated and robustly validated the first model of ovarian volume from conception to 82 years of age. This model shows that 69% of the variation in ovarian volume is due to age alone. We have shown that in the average case ovarian volume rises from 0.7 mL (95% CI 0.4-1.1 mL) at 2 years of age to a peak of 7.7 mL (95% CI 6.5-9.2 mL) at 20 years of age with a subsequent decline to about 2.8 mL (95% CI 2.7-2.9 mL) at the menopause and smaller volumes thereafter. Our model allows us to generate normal values and ranges for ovarian volume throughout life. This is the first validated normative model of ovarian volume from conception to old age; it will be of use in the diagnosis and management of a number of diverse gynaecological and reproductive conditions in females from birth to menopause and beyond.
Article
Inhibin B has been identified as a potential marker of Sertoli cell function in males. The aim of this study is to produce a normative model of serum inhibin B in males from birth to seventeen years. We used a well-defined search strategy to identify studies containing data that can contribute to a larger approximation of the healthy population. We combined data from four published studies (n = 709) and derived an internally validated model with high goodness of fit and normally distributed residuals. Our results show that inhibin B increases following birth to a post-natal peak of 270 pg/mL (IQR 210–335 pg/mL) and then decreases during childhood followed by a rise at around 8 years, peaking at a mean 305 pg/mL (IQR 240–445 pg/mL) at around age 17. Following this peak there is a slow decline to the standard mature adult normal range of 170 pg/mL (IQR 125–215 pg/mL). This normative model suggests that 35% of the variation in Inhibin B levels in young males is due to age alone, provides an age-specific reference range for inhibin B in the young healthy male population, and will be a powerful tool in evaluating the potential of inhibin B as a marker of Sertoli cell function in pre-pubertal boys.
  • M J Faddy
The Relationship Between Variation in Size of the Primordial Follicle Pool and Age at Natural Menopause. Depmann M, Faddy MJ, et al. J Clin Endocrinol Metab 100(6):E845-51; 2015