Content uploaded by Thomas W Kelsey
Author content
All content in this area was uploaded by Thomas W Kelsey on Jun 27, 2016
Content may be subject to copyright.
Data Science
Each branch of Science has a quantity or unit that underpins
research activity
Physicists investigate energy
Chemists investigate molecules
Biologists investigate organisms
Economists investigate money
...
Computer Scientists investigate data
This talk is about making predictions from data
Tom Kelsey ISS-2016 2016-06-27 2 / 23
Motivating example: Titanic
On April 15, 1912, during her maiden voyage, the Titanic sank
after colliding with an iceberg, killing 1502 out of 2224
passengers and crew. Real data about the passengers
Type of passenger
Ticket class, ticket price, cabin, port of embarkation, ...
The families of passenger
Age, gender, number of siblings/spouses aboard, number of
parents/children aboard, ...
Did they survive?
Tom Kelsey ISS-2016 2016-06-27 3 / 23
Titanic Kaggle Competition
Given some of the data, can we accurately predict survival for
passengers we don’t know about?
https://www.kaggle.com/c/titanic
Kaggle provide data that can be used to derive predictive
models
some entries are missing
Kaggle provide data for testing candidate models
Kaggle keep back some data for assessing model
performance
Tom Kelsey ISS-2016 2016-06-27 7 / 23
Model quality
All models are wrong but some models are useful.
GEORGE E. P. BOX "ROBU STNESS IN T HE STRATEGY OF
SCIENTIFIC MODEL BUIL DIN G" (MAY 1979)
No expectation of the same model arising from the same
data by two researchers...
...or the same researcher doing it twice
This isn’t a problem
We get to
use
models, and we can then easily classify them:
poor predictive performance – terrible model, scrap it
reasonable predictive performance – OK model, use it for
now but maybe think about looking for improvements
good predictive performance – good model, keep it
We now derive a simple predictive model
Tom Kelsey ISS-2016 2016-06-27 12 / 23
Example: binary class prediction
Suppose we have a new observation we wish to predict for,
which is a medium-income student-youth with fair credit rating.
x= [Age =Youth,Income =Medium,Student =Yes,Credit =
Fair]
We use probabilities: for example P(Friday) = 1/7
We also use conditional probabilities:
P(Friday|TomorrowIsSaturday) = 1
Seek the class Cithat makes P(x|Ci)P(Ci)as big as possible
P
(
C
i)
are straightforward: P
(
Yes
) =
9
/
14 and P
(
No
) =
5
/
14
We now calculate P(xj|Ci)for j=1, . . . , 4 and i=1, 2
Tom Kelsey ISS-2016 2016-06-27 14 / 23
Example: binary class prediction
We now calculate P(xj|Ci)for j=1, . . . , 4 and i=1, 2
1P(Age =Youth|Y) = 2
9,P(Age =Youth|No) = 3
5.
2P(Income =Med|Y) = 4
9,P(Income =Med|No) = 2
5.
3P(Student =Yes|Y) = 6
9,P(Student =Yes|No) = 1
5.
4P(Credit =Fair|Y) = 6
9,P(Credit =Fair|No) = 2
5.
Tom Kelsey ISS-2016 2016-06-27 15 / 23
Example: binary class prediction
Assuming conditional independence:
P(x|Y) = 2×4×6×6
9×9×9×9=≈0.044
P(x|N) = 3×2×1×2
5×5×5×5=≈0.019
Ultimately giving us:
P(x|Y)P(Y) = 9
14 ×0.044 ≈0.028
P(x|N)P(N) = 5
14 ×0.019 ≈0.007
So choose the class Cto be Buy =Yes as our prediction - it is
more likely given x. So spending on advertising to this type of
customer is likely to lead to higher profits.
Tom Kelsey ISS-2016 2016-06-27 16 / 23
Example: predict gender from given name
x1is a binary variable: Y for European, N for not
x2is number of vowels: 2 for two or fewer, 3 for three or
more
x3is length of given name: S for 4 or fewer, M for 5 through
7, L for 8 or more
x4is a binary variable: Y if the given name ends in a vowel,
N otherwise
yis our gender variable: M or F (for this lecture)
If we collect some data, what is our prediction based on your
name?
Tom Kelsey ISS-2016 2016-06-27 17 / 23
Predictive Models
The key thing is that you can’t cheat!
We get a candidate model from known data
We optimise and select models on known data
Our final model is used on unseen data
Performance will be good, bad or indifferent
The Kaggle website has many competitions – the winners
are the best models for predicting based on unseen data
Tom Kelsey ISS-2016 2016-06-27 21 / 23
Summary
I do predictive modelling using Medical data – see the references
at the end of this presentation
Many other important areas of Data Science
Networks
Visualisation
Algorithms
Security & privacy (note that no first names were given
explicitly in the example)
...
Thanks for listening.
Tom Kelsey ISS-2016 2016-06-27 22 / 23
References
1A Validated Normative Model for Human Uterine Volume from Birth to Age 40
Years. Kelsey TW, Ginbey E, et al. PLoS One 11(6):e0157375; 2016
2A Normative Model of Serum Inhibin B in Young Males. Kelsey TW, Miles A, et
al. PLoS One 11(4):e0153843; 2016
3Accuracy of circulating adiponectin for predicting gestational diabetes: a
systematic review and meta-analysis. Iliodromiti S, Sassarini J, et al. Diabetologia
59(4):692-9; 2016
4An externally validated age-related model of mean follicle density in the cortex
of the human ovary. McLaughlin M, Kelsey TW, et al. J Assist Reprod Genet
32(7):1089-95; 2015
5The Relationship Between Variation in Size of the Primordial Follicle Pool and
Age at Natural Menopause. Depmann M, Faddy MJ, et al. J Clin Endocrinol
Metab 100(6):E845-51; 2015
6A validated age-related normative model for male total testosterone shows
increasing variance but no decline after age 40 years. Kelsey TW, Li LQ, et al.
PLoS One 9(10):e109346; 2014
7Ovarian volume throughout life: a validated normative model. Kelsey TW,
Dodwell SK, et al. PLoS One 8(9):e71465; 2013
Tom Kelsey ISS-2016 2016-06-27 23 / 23