BookPDF Available

Data Science for Business

  • Silicon Valley Data Science
A must-read resource for anyone who is serious
about embracing the opportunity of big data.
Craig Vaughan
Global Vice President at SAP
“This timely book says out loud what has finally become apparent: in the modern world,
Data is Business, and you can no longer think business without thinking data. Read this
book and you will understand the Science behind thinking data.
Ron Bekkerman
Chief Data Officer at Carmel Ventures
A great book for business managers who lead or interact with data scientists, who wish to
better understand the principles and algorithms available without the technical details of
single-disciplinary books.
Ronny Kohavi
Partner Architect at Microsoft Online Services Division
“Provost and Fawcett have distilled their mastery of both the art and science of real-world
data analysis into an unrivalled introduction to the field.
Geo Webb
Editor-in-Chief of Data Mining and Knowledge Discovery Journal
“I would love it if everyone I had to work with had read this book.
Claudia Perlich
Chief Scientist of Dstillery and Advertising Research
Foundation Innovation Award Grand Winner (2013)
A foundational piece in the fast developing world of Data Science.
A must read for anyone interested in the Big Data revolution.”
Justin Gapper
Business Unit Analytics Manager
at Teledyne Scientific and Imaging
“The authors, both renowned experts in data science before it had a name, have taken a
complex topic and made it accessible to all levels, but mostly helpful to the budding data
scientist. As far as I know, this is the first book of its kind—with a focus on data science
concepts as applied to practical business problems. It is liberally sprinkled with
compelling real-world examples outlining familiar, accessible problems in the business
world: customer churn, targeted marking, even whiskey analytics!
The book is unique in that it does not give a cookbook of algorithms, rather it helps the
reader understand the underlying concepts behind data science, and most importantly
how to approach and be successful at problem solving. Whether you are looking for a
good comprehensive overview of data science or are a budding data scientist in need of
the basics, this is a must-read.
Chris Volinsky
Director of Statistics Research at AT&T Labs and Winning
Team Member for the $1 Million Netflix Challenge
“This book goes beyond data analytics 101. It’s the essential guide for those of us (all of
us?) whose businesses are built on the ubiquity of data opportunities and the new
mandate for data-driven decision-making.
Tom Phillips
CEO of Dstillery and Former Head of
Google Search and Analytics
“Intelligent use of data has become a force powering business to new levels of
competitiveness. To thrive in this data-driven ecosystem, engineers, analysts, and
managers alike must understand the options, design choices, and tradeoffs before them.
With motivating examples, clear exposition, and a breadth of details covering not only the
“hows” but the “whys, Data Science for Business is the perfect primer for those wishing to
become involved in the development and application of data-driven systems.
Josh Attenberg
Data Science Lead at Etsy
“Data is the foundation of new waves of productivity growth, innovation, and richer
customer insight. Only recently viewed broadly as a source of competitive advantage,
dealing well with data is rapidly becoming table stakes to stay in the game.
The authors’ deep applied experience makes this a must read—a window into your
competitor’s strategy.
Alan Murray
Serial Entrepreneur; Partner at Coriolis Ventures; Co-Founder Neuehouse
“One of the best data mining books, which helped me think through various ideas on
liquidity analysis in the FX business. The examples are excellent and help you take a deep
dive into the subject! This one is going to be on my shelf for lifetime!”
Nidhi Kathuria
Vice President of FX at Royal Bank of Scotland
An excellent and accessible primer to help businessfolk better appreciate the concepts,
tools and techniques employed by data scientists... and for data scientists to better
appreciate the business context in which their solutions are deployed.
Joe McCarthy
Director of Analytics and Data Science at Atigeo, LLC
“In my opinion it is the best book on Data Science and Big Data for a professional
understanding by business analysts and managers who must apply these techniques in the
practical world.
Ira Laefsky
MS Engineering (Computer Science)/MBA Information Technology and Human
Computer Interaction Researcher formerly on the Senior Consulting Staff
of Arthur D. Little, Inc. and Digital Equipment Corporation
“With motivating examples, clear exposition and a breadth of details covering not only
the “hows” but the “whys,” Data Science for Business is the perfect primer for those
wishing to become involved in the development and application of data driven systems.
Ted O’Brien
Co-Founder / Director of Talent Acquisition at Starbridge
Partners and Publisher of the Data Science Report
Foster Provost and Tom Fawcett**
Special Edition for Data Science for Business Analytics,
Stern School, NYU
Data Science for Business
Data Science for Business
by Foster Provost and Tom Fawcett
Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles ( For more information, contact our corporate/
institutional sales department: 800-998-9938 or
Editors: Mike Loukides and Meghan Blanchette
Production Editor: Christopher Hearse
Proofreader: Kiel Van Horn
Indexer: WordCo Indexing Services, Inc.
Interior Designer: David Futato
Cover Designer: Mark Paglietti
Illustrator: Rebecca Demarest
July 2013: First Edition
Revision History for the First Edition
2013-07-25: First Release
2013-12-19: Second Release
yyyy-mm-dd: Third Release
See for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science for Business, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc. Data Science for Business is a trade
mark of Foster Provost and Tom Fawcett.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
For our fathers.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1. Introduction: Data-Analytic Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Ubiquity of Data Opportunities 1
Example: Hurricane Frances 3
Example: Predicting Customer Churn 4
Data Science, Engineering, and Data-Driven Decision Making 5
Data Processing and “Big Data 8
From Big Data 1.0 to Big Data 2.0 8
Data and Data Science Capability as a Strategic Asset 9
Data-Analytic Thinking 12
This Book 14
Data Mining and Data Science, Revisited 14
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data
Scientist 16
Summary 17
2. Business Problems and Data Science Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Fundamental concepts: A set of canonical data mining tasks; The data mining process;
Supervised versus unsupervised data mining.
From Business Problems to Data Mining Tasks 19
Supervised Versus Unsupervised Methods 24
Data Mining and Its Results 26
The Data Mining Process 27
Business Understanding 28
Data Understanding 28
Data Preparation 30
Modeling 31
Evaluation 31
Deployment 33
Implications for Managing the Data Science Team 34
Other Analytics Techniques and Technologies 35
Statistics 36
Database Querying 38
Data Warehousing 39
Regression Analysis 39
Machine Learning and Data Mining 40
Answering Business Questions with These Techniques 41
Summary 42
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation. 43
Fundamental concepts: Identifying informative attributes; Segmenting data by
progressive attribute selection.
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree
Models, Induction, and Prediction 45
Supervised Segmentation 48
Selecting Informative Attributes 49
Example: Attribute Selection with Information Gain 56
Supervised Segmentation with Tree-Structured Models 62
Visualizing Segmentations 69
Trees as Sets of Rules 72
Probability Estimation 72
Example: Addressing the Churn Problem with Tree Induction 75
Summary 80
4. Fitting a Model to Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Fundamental concepts: Finding “optimal” model parameters based on data; Choosing
the goal for data mining; Objective functions; Loss functions.
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.
Classification via Mathematical Functions 85
Linear Discriminant Functions 87
Optimizing an Objective Function 90
An Example of Mining a Linear Discriminant from Data 91
Linear Discriminant Functions for Scoring and Ranking Instances 93
Support Vector Machines, Briefly 94
Regression via Mathematical Functions 97
Class Probability Estimation and Logistic “Regression 99
* Logistic Regression: Some Technical Details 102
Example: Logistic Regression versus Tree Induction 105
x | Table of Contents
Nonlinear Functions, Support Vector Machines, and Neural Networks 110
Summary 113
5. Overtting and Its Avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Fundamental concepts: Generalization; Fitting and overtting; Complexity control.
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning;
Generalization 115
Overfitting 117
Overfitting Examined 117
Holdout Data and Fitting Graphs 117
Overfitting in Tree Induction 120
Overfitting in Mathematical Functions 122
Example: Overfitting Linear Functions 123
* Example: Why Is Overfitting Bad? 128
From Holdout Evaluation to Cross-Validation 130
The Churn Dataset Revisited 134
Learning Curves 135
Overfitting Avoidance and Complexity Control 138
Avoiding Overfitting with Tree Induction 138
A General Method for Avoiding Overfitting 139
* Avoiding Overfitting for Parameter Optimization 141
Summary 146
6. Similarity, Neighbors, and Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Fundamental concepts: Calculating similarity of objects described by data; Using
similarity for prediction; Clustering as similarity-based segmentation.
Exemplary techniques: Searching for similar entities; Nearest neighbor methods;
Clustering methods; Distance metrics for calculating similarity.
Similarity and Distance 148
Nearest-Neighbor Reasoning 150
Example: Whiskey Analytics 151
Nearest Neighbors for Predictive Modeling 153
How Many Neighbors and How Much Influence? 156
Geometric Interpretation, Overfitting, and Complexity Control 158
Issues with Nearest-Neighbor Methods 161
Some Important Technical Details Relating to Similarities and Neighbors 164
Heterogeneous Attributes 164
* Other Distance Functions 165
* Combining Functions: Calculating Scores from Neighbors 168
Clustering 170
Example: Whiskey Analytics Revisited 171
Table of Contents | xi
Hierarchical Clustering 171
Nearest Neighbors Revisited: Clustering Around Centroids 177
Example: Clustering Business News Stories 182
Understanding the Results of Clustering 186
* Using Supervised Learning to Generate Cluster Descriptions 188
Stepping Back: Solving a Business Problem Versus Data Exploration 191
Summary 194
7. Decision Analytic Thinking I: What Is a Good Model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Fundamental concepts: Careful consideration of what is desired from data science
results; Expected value as a key evaluation framework; Consideration of appropriate
comparative baselines.
Exemplary techniques: Various evaluation metrics; Estimating costs and benets;
Calculating expected prot; Creating baseline methods for comparison.
Evaluating Classifiers 196
Plain Accuracy and Its Problems 197
The Confusion Matrix 197
Problems with Unbalanced Classes 198
Problems with Unequal Costs and Benefits 202
Generalizing Beyond Classification 202
A Key Analytical Framework: Expected Value 203
Using Expected Value to Frame Classifier Use 204
Using Expected Value to Frame Classifier Evaluation 206
Evaluation, Baseline Performance, and Implications for Investments in Data 214
Summary 217
8. Visualizing Model Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Fundamental concepts: Visualization of model performance under various kinds of
uncertainty; Further consideration of what is desired from data mining results.
Exemplary techniques: Prot curves; Cumulative response curves; Lift curves; ROC
Ranking Instead of Classifying 219
Profit Curves 222
ROC Graphs and Curves 224
The Area Under the ROC Curve (AUC) 230
Cumulative Response and Lift Curves 230
Example: Performance Analytics for Churn Modeling 234
Summary 242
9. Evidence and Probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Fundamental concepts: Explicit evidence combination with Bayes’ Rule; Probabilistic
reasoning via assumptions of conditional independence.
xii | Table of Contents
Exemplary techniques: Naive Bayes classication; Evidence lift.
Example: Targeting Online Consumers With Advertisements 245
Combining Evidence Probabilistically 247
Joint Probability and Independence 248
Bayes’ Rule 249
Applying Bayes’ Rule to Data Science 251
Conditional Independence and Naive Bayes 253
Advantages and Disadvantages of Naive Bayes 255
A Model of Evidence “Lift257
Example: Evidence Lifts from Facebook “Likes258
Evidence in Action: Targeting Consumers with Ads 260
Summary 260
10. Representing and Mining Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Fundamental concepts: The importance of constructing mining-friendly data
representations; Representation of text for data mining.
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams;
Stemming; Named entity extraction; Topic models.
Why Text Is Important 264
Why Text Is Difficult 264
Representation 265
Bag of Words 266
Term Frequency 266
Measuring Sparseness: Inverse Document Frequency 269
Combining Them: TFIDF 270
Example: Jazz Musicians 271
* The Relationship of IDF to Entropy 275
Beyond Bag of Words 277
N-gram Sequences 277
Named Entity Extraction 278
Topic Models 278
Example: Mining News Stories to Predict Stock Price Movement 280
The Task 280
The Data 282
Data Preprocessing 284
Results 285
Summary 289
11. Decision Analytic Thinking II: Toward Analytical Engineering. . . . . . . . . . . . . . . . . . . . 291
Fundamental concept: Solving business problems with data science starts with
analytical engineering: designing an analytical solution, based on the data, tools, and
techniques available.
Table of Contents | xiii
Exemplary technique: Expected value as a framework for data science solution design.
Targeting the Best Prospects for a Charity Mailing 292
The Expected Value Framework: Decomposing the Business Problem and
Recomposing the Solution Pieces 292
A Brief Digression on Selection Bias 295
Our Churn Example Revisited with Even More Sophistication 295
The Expected Value Framework: Structuring a More Complicated Business
Problem 296
Assessing the Influence of the Incentive 297
From an Expected Value Decomposition to a Data Science Solution 299
Summary 302
12. Other Data Science Tasks and Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Fundamental concepts: Our fundamental concepts as the basis of many common data
science techniques; The importance of familiarity with the building blocks of data
Exemplary techniques: Association and co-occurrences; Behavior proling; Link
prediction; Data reduction; Latent information mining; Movie recommendation; Bias-
variance decomposition of error; Ensembles of models; Causal reasoning from data.
Co-occurrences and Associations: Finding Items That Go Together 304
Measuring Surprise: Lift and Leverage 305
Example: Beer and Lottery Tickets 306
Associations Among Facebook Likes 307
Profiling: Finding Typical Behavior 310
Link Prediction and Social Recommendation 315
Data Reduction, Latent Information, and Movie Recommendation 316
Bias, Variance, and Ensemble Methods 320
Data-Driven Causal Explanation and a Viral Marketing Example 323
Summary 324
13. Data Science and Business Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Fundamental concepts: Our principles as the basis of success for a data-driven
business; Acquiring and sustaining competitive advantage via data science; The
importance of careful curation of data science capability.
Thinking Data-Analytically, Redux 327
Achieving Competitive Advantage with Data Science 329
Sustaining Competitive Advantage with Data Science 330
Formidable Historical Advantage 331
Unique Intellectual Property 332
Unique Intangible Collateral Assets 332
Superior Data Scientists 332
Superior Data Science Management 334
xiv | Table of Contents
Attracting and Nurturing Data Scientists and Their Teams 335
Examine Data Science Case Studies 337
Be Ready to Accept Creative Ideas from Any Source 338
Be Ready to Evaluate Proposals for Data Science Projects 339
Example Data Mining Proposal 339
Flaws in the Big Red Proposal 340
A Firm’s Data Science Maturity 342
14. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
The Fundamental Concepts of Data Science 345
Applying Our Fundamental Concepts to a New Problem: Mining Mobile
Device Data 348
Changing the Way We Think about Solutions to Business Problems 351
What Data Can’t Do: Humans in the Loop, Revisited 352
Privacy, Ethics, and Mining Data About Individuals 355
Is There More to Data Science? 356
Final Example: From Crowd-Sourcing to Cloud-Sourcing 357
Final Words 358
A. Proposal Review Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
B. Another Sample Proposal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Table of Contents | xv
Data Science for Business is intended for several sorts of readers:
Business people who will be working with data scientists, managing data science–
oriented projects, or investing in data science ventures,
Developers who will be implementing data science solutions, and
Aspiring data scientists.
This is not a book about algorithms, nor is it a replacement for a book about algo
rithms. We deliberately avoided an algorithm-centered approach. We believe there is
a relatively small set of fundamental concepts or principles that underlie techniques
for extracting useful knowledge from data. These concepts serve as the foundation for
many well-known algorithms of data mining. Moreover, these concepts underlie the
analysis of data-centered business problems, the creation and evaluation of data sci
ence solutions, and the evaluation of general data science strategies and proposals.
Accordingly, we organized the exposition around these general principles rather than
around specific algorithms. Where necessary to describe procedural details, we use a
combination of text and diagrams, which we think are more accessible than a listing
of detailed algorithmic steps.
The book does not presume a sophisticated mathematical background. However, by
its very nature the material is somewhat technical—the goal is to impart a significant
understanding of data science, not just to give a high-level overview. In general, we
have tried to minimize the mathematics and make the exposition as “conceptual” as
Colleagues in industry comment that the book is invaluable for helping to align the
understanding of the business, technical/development, and data science teams. That
observation is based on a small sample, so we are curious to see how general it truly is
(see Chapter 5!). Ideally, we envision a book that any data scientist would give to his
collaborators from the development or business teams, effectively saying: if you really
want to design/implement top-notch data science solutions to business problems, we
all need to have a common understanding of this material.
Colleagues also tell us that the book has been quite useful in an unforeseen way: for
preparing to interview data science job candidates. The demand from business for
hiring data scientists is strong and increasing. In response, more and more job seek
ers are presenting themselves as data scientists. Every data science job candidate
should understand the fundamentals presented in this book. (Our industry colleagues
tell us that they are surprised how many do not. We have half-seriously discussed a
follow-up pamphlet “Cliff’s Notes to Interviewing for Data Science Jobs.”)
Our Conceptual Approach to Data Science
In this book we introduce a collection of the most important fundamental concepts of
data science. Some of these concepts are “headliners for chapters, and others are
introduced more naturally through the discussions (and thus they are not necessarily
labeled as fundamental concepts). The concepts span the process from envisioning
the problem, to applying data science techniques, to deploying the results to improve
decision-making. The concepts also undergird a large array of business analytics
methods and techniques.
The concepts fit into three general types:
1. Concepts about how data science fits in the organization and the competitive
landscape, including ways to attract, structure, and nurture data science teams;
ways for thinking about how data science leads to competitive advantage; and
tactical concepts for doing well with data science projects.
2. General ways of thinking data-analytically. These help in identifying appropriate
data and consider appropriate methods. The concepts include the data mining
process as well as the collection of different high-level data mining tasks.
3. General concepts for actually extracting knowledge from data, which undergird
the vast array of data science tasks and their algorithms.
For example, one fundamental concept is that of determining the similarity of two
entities described by data. This ability forms the basis for various specific tasks. It
may be used directly to nd customers similar to a given customer. It forms the core
of several prediction algorithms that estimate a target value such as the expected
resource usage of a client or the probability of a customer to respond to an offer. It is
also the basis for clustering techniques, which group entities by their shared features
without a focused objective. Similarity forms the basis of information retrieval, in
which documents or webpages relevant to a search query are retrieved. Finally, it
underlies several common algorithms for recommendation. A traditional algorithm-
oriented book might present each of these tasks in a different chapter, under different
xviii | Preface
1Of course, each author has the distinct impression that he did the majority of the work on the book.
names, with common aspects buried in algorithm details or mathematical proposi
tions. In this book we instead focus on the unifying concepts, presenting specific
tasks and algorithms as natural manifestations of them.
As another example, in evaluating the utility of a pattern, we see a notion of li
how much more prevalent a pattern is than would be expected by chance—recurring
broadly across data science. It is used to evaluate very different sorts of patterns in
different contexts. Algorithms for targeting advertisements are evaluated by comput
ing the lift one gets for the targeted population. Lift is used to judge the weight of
evidence for or against a conclusion. Lift helps determine whether a co-occurrence
(an association) in data is interesting, as opposed to simply being a natural conse
quence of popularity.
We believe that explaining data science around such fundamental concepts not only
aids the reader, it also facilitates communication between business stakeholders and
data scientists. It provides a shared vocabulary and enables both parties to under
stand each other better. The shared concepts lead to deeper discussions that may
uncover critical issues otherwise missed.
To the Instructor
This book has been used successfully as a textbook for a very wide variety of data sci
ence and business analytics courses. Historically, the book arose from the develop
ment of Foster’s multidisciplinary Data Science and Business Analytics classes at the
Stern School at NYU, starting in the fall of 2005.1 The original class was nominally for
MBA students and MSIS students, but drew students from schools across the univer
sity. The most interesting aspect of the class was not that it appealed to MBA and
MSIS students, for whom it was designed. More interesting, it also was found to be
very valuable by students with strong backgrounds in machine learning and other
technical disciplines. Part of the reason seemed to be that the focus on fundamental
principles and other issues besides algorithms was missing from their curricula.
At NYU we now use the book in support of a variety of data science–related pro
grams: the original MBA and MSIS programs, undergraduate business analytics,
NYU/Stern’s MS in Business Analytics program, executive education, and as the
Introduction to Data Science for NYU’s MS in Data Science. In addition, the book has
been adopted by well over 100 other universities for programs in at least 22 countries
(and counting), in business schools, in data science programs, in computer science
programs, and for more general introductions to data science.
Preface | xix
The books website gives pointers on how to obtain helpful instructional material,
including lecture slides, sample homework questions and problems, example project
instructions based on the frameworks from the book, exam questions, and more.
We keep an up-to-date list of known adopters on the book’s web
site. Click Who’s Using It at the top.
Other Skills and Concepts
There are many other concepts and skills that a practical data scientist needs to know
besides the fundamental principles of data science. These skills and concepts will be
discussed in Chapter 1 and Chapter 2. The interested reader is encouraged to visit the
book’s website for pointers to material for learning these additional skills and con
cepts (for example, scripting in Python, Unix command-line processing, datafiles,
common data formats, databases and querying, big data architectures and systems
like MapReduce and Hadoop, data visualization, and other related topics).
Sections and Notation
In addition to occasional footnotes, the book contains boxed “sidebars.These are
essentially extended footnotes. We reserve these for material that we consider inter
esting and worthwhile, but too long for a footnote and too much of a digression for
the main text.
Technical Details Ahead — A note on the starred sections
The occasional mathematical details are relegated to optional “star
red” sections. These section titles will have asterisk prefixes, and
they will be preceded by a paragraph rendered like this one. Such
“starred sections contain more detailed mathematics and/or more
technical details than elsewhere, and these introductory paragraph
explains its purpose. The book is written so that these sections may
be skipped without loss of continuity, although in a few places we
remind readers that details appear there.
Constructions in the text like (Smith and Jones, 2003) indicate a reference to an entry
in the bibliography (in this case, the 2003 article or book by Smith and Jones); “Smith
and Jones (2003)” is a similar reference. A single bibliography for the entire book
appears in the endmatter.
xx | Preface
In this book we try to keep math to a minimum, and what math there is we have sim
plified as much as possible without introducing confusion. For our readers with tech
nical backgrounds, a few comments may be in order regarding our simplifying
1. We avoid Sigma (Σ) and Pi (Π) notation, commonly used in textbooks to indicate
sums and products, respectively. Instead we simply use equations with ellipses
like this:
f x =w1x1+w2x2++wnxn
In the technical, “starred sections we sometimes adopt Sigma and Pi notation
when this ellipsis approach is just too cumbersome. We assume people reading
these sections are somewhat more comfortable with math notation and will not
be confused.
2. Statistics books are usually careful to distinguish between a value and its estimate
by putting a “hat” on variables that are estimates, so in such books you’ll typically
see a true probability denoted p and its estimate denoted p. In this book we are
almost always talking about estimates from data, and putting hats on everything
makes equations verbose and ugly. Everything should be assumed to be an esti
mate from data unless we say otherwise.
3. We simplify notation and remove extraneous variables where we believe they are
clear from context. For example, when we discuss classifiers mathematically, we
are technically dealing with decision predicates over feature vectors. Expressing
this formally would lead to equations like:
fRx=xAge +0.7×xBalance + 60
Instead we opt for the more readable:
fx=Age + 0 . 7 ×Balance + 60
with the understanding that x is a vector and Age and Balance are components of
We have tried to be consistent with typography, reserving fixed-width typewriter
fonts like sepal_width to indicate attributes or keywords in data. For example, in the
text-mining chapter, a word like 'discussing' designates a word in a document while
discuss might be the resulting token in the data.
The following typographical conventions are used in this book:
Preface | xxi
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter
mined by context.
Throughout the book we have placed special inline tips and warnings relevant to the
material. They will be rendered differently depending on whether you’re reading
paper, PDF, or an ebook, as follows:
A sentence or paragraph typeset like this signifies a tip or a sugges
This text and element signifies a general note.
Text rendered like this signifies a warning or caution. These are
more important than tips and are used sparingly.
Using Examples
In addition to being an introduction to data science, this book is intended to be useful
in discussions of and day-to-day work in the field. Answering a question by citing
this book and quoting examples does not require permission. We appreciate, but do
not require, attribution. Formal attribution usually includes the title, author, pub
lisher, and ISBN. For example: Data Science for Business by Foster Provost and Tom
Fawcett (O’Reilly). Copyright 2013 Foster Provost and Tom Fawcett,
If you feel your use of examples falls outside fair use or the permission given above,
feel free to contact us at
xxii | Preface
Safari® Books Online
Safari Books Online is an on-demand digital library that deliv
ers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable
database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-
Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
Course Technology, and dozens more. For more information about Safari Books
Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472
800-998-9938 (in the United States or Canada) 707-829-0515 (international or local)
707-829-0104 (fax)
We have two web pages for this book, where we list errata, examples, and any addi
tional information. You can access the publisher’s page at
and the authors’ page at
To comment or ask technical questions about this book, send email to bookques
For more information about O’Reilly Media’s books, courses, conferences, and news,
see their website at
Find us on Facebook:
Follow us on Twitter:
Watch us on YouTube:
Preface | xxiii
Thanks to all the many colleagues and others who have provided invaluable ideas,
feedback, criticism, suggestions, and encouragement based on discussions and many
prior draft manuscripts. At the risk of missing someone, let us thank in particular:
Panos Adamopoulos, Manuel Arriaga, Josh Attenberg, Solon Barocas, Ron Bekker
man, Enrico Bertini, Josh Blumenstock, Ohad Brazilay, Aaron Brick, Jessica Clark,
Nitesh Chawla, Brian d’Alessandro, Peter Devito, Vasant Dhar, Jan Ehmke, Theos
Evgeniou, Justin Gapper, Tomer Geva, Daniel Gillick, Shawndra Hill, Nidhi Kathuria,
Ronny Kohavi, Marios Kokkodis, Tom Lee, Philipp Marek, David Martens, Sophie
Mohin, Lauren Moores, Alan Murray, Nick Nishimura, Balaji Padmanabhan, Jason
Pan, Claudia Perlich, Gregory Piatetsky-Shapiro, Tom Phillips, Kevin Reilly, Maytal
Saar-Tsechansky, Evan Sadler, Galit Shmueli, Roger Stein, Nick Street, Kiril Tsemekh
man, Akhmed Umyarov, Craig Vaughan, Chris Volinsky, Wally Wang, Geoff Webb,
Debbie Yuster, and Rong Zheng. We would also like to thank more generally the stu
dents from Foster’s classes, Data Mining for Business Analytics, Practical Data Sci
ence, Data Analytics, Introduction to Data Science, and the Data Science Research
Seminar. Questions and issues that arose when using prior drafts of this book pro
vided substantive feedback for improving it.
Thanks to all the colleagues who have taught us about data science and about how to
teach data science over the years. Thanks especially to Maytal Saar-Tsechansky, Clau
dia Perlich, Shawndra Hill, and Vasant Dhar. Maytal graciously shared with Foster
her notes for her data mining class many years ago. The classification tree example in
Chapter 3 (thanks especially for the “bodies visualization) is based mostly on her
idea and example; her ideas and example were the genesis for the visualization com
paring the partitioning of the instance space with trees and linear discriminant func
tions in Chapter 4, the “Will David Respond” example in Chapter 6 is based on her
example, and probably other things long forgotten. Claudia has taught companion
sections of Data Mining for Business Analytics/Introduction to Data Science along
with Foster for the past few years, and has taught him much about data science in the
process (and beyond). Shawndra helped Foster with putting together his new kind of
data mining class over a decade ago. And way back in the 1990s Vasant taught the
first data mining course for a business audience, and invited Foster (then an industry
data scientist) to guest lecture about real-world data mining applications.
Thanks to David Stillwell, Thore Graepel, and Michal Kosinski for providing the
Facebook Like data for some of the examples. Thanks to Nick Street for providing the
cell nuclei data and for letting us use the cell nuclei image in Chapter 4. Thanks to
David Martens for his help with the mobile locations visualization. Thanks to Chris
Volinsky for providing data from his work on the Netflix Challenge. Thanks to Sonny
Tambe for early access to his results on big data technologies and productivity.
Thanks to Patrick Perry for pointing us to the bank call center example used in Chap
xxiv | Preface
ter 12. Thanks to Geoff Webb for the use of the Magnum Opus association mining
Thanks especially to our editor Mike Loukides, who shared our vision for a different
sort of book, and the entire O’Reilly team for helping us to make it a reality.
Most of all we thank our families for their love, patience and encouragement.
A great deal of open source software was used in the preparation of this book and its
examples. The authors wish to thank the developers and contributors of:
Python and Perl
Scipy, Numpy, Matplotlib, and Scikit-Learn
The Machine Learning Repository at the University of California at Irvine (Bache
& Lichman, 2013)
Finally, we encourage readers to check our website for updates to this material, new
chapters, errata, addenda, and accompanying slide sets.
—Foster Provost and Tom Fawcett
Preface | xxv
... It also involves using Big Data for economic forecasting, consumer behavior analysis, and market trend analysis [27]. Machine Learning finds widespread use in predictive analytics, which aims to predict future outcomes based on historical data [28], prescriptive analytics, which suggests decision options on the basis of predictive analysis, and descriptive analytics, which interprets historical data to identify patterns and relationships [29]. ...
Full-text available
p class="MsoNormal" style="margin-top: 12pt; line-height: 14pt; text-align: justify;"> This paper provides a thorough review of the shifting landscape of economic analysis, spotlighting recent trends and predicting future paths. While traditional economic models remain key for interpreting economic activity, they are being supplemented by fresh methods and cross-disciplinary viewpoints. The increased attention to inequality studies, using advanced statistical techniques and unique data sources, underscores the growing emphasis on fairness and distribution within economic analysis. The incorporation of behavioral elements into economic models also expands our comprehension of economic decision-making and market results. Notably, the emergence of computational economics-integrating artificial intelligence (AI), big data, and machine learning into economic scrutiny-represents a major development. Often referred to as ’smart economics,’ this field employs technology to formulate, address complex economic dilemmas, and perceive economic activity in unconventional ways. Yet, the application of AI and machine learning in economics introduces new hurdles around data privacy, algorithmic bias, and the transparency of model outcomes. The impact of the digital revolution on economic analysis is significant, as the advent of computational economics and the surge of big data are transforming research techniques and policy implications. Concurrently, the advent of the circular economy indicates a radical shift in our perspective on economic sustainability, carrying considerable implications for environmental policy and business tactics. In the future, it’s anticipated that these trends will further modify the realm of economic analysis, with AI and machine learning integration, emphasis on sustainability and fairness, and the influence of big data becoming more pronounced. As these changes take place, it’s imperative for researchers, policymakers, and practitioners to remain adaptable and flexible, prepared to capitalize on the opportunities and tackle the challenges these trends present. </p
... Over the last twenty years, there has been a significant amount of investment in business infrastructure, resulting in a better capacity to gather data. Companies are contemplating the possibility of leveraging their data expertise to gain a competitive edge [Provost and Fawcett 2013b]. ...
Conference Paper
The availability of data and the increased processing power of computers have made it easier to make decisions based on data, specially with Artificial Intelligence. One area where AI is widely applicable in companies is Supply Chain Management, particularly in demand forecasting. This paper aims to forecast sales for a company in the Cosmetic, Fragrance, and Toiletry market. Data from 2019 to 2023 were used from two different sales channel. To predict the demand, three Gradient Boosting algorithms (CatBoost, LightGBM, and XGBoost) were compared, and forecasts were made for three different time horizons (next period, five and ten periods ahead). After the experiments, LightGBM showed more stability compared to the other models.
... Al igual que cualquier otro sistema altamente interactivo, los juegos serios generan grandes cantidades de datos, que reflejan directamente las acciones y decisiones del jugador (Dörner et al., 2016). Las técnicas de minería de datos (MD), que hoy en día son muy comunes en diversos campos como la educación, medicina y finanzas (Kumar & Bhardwaj, 2011), pueden aplicarse a la gran cantidad de información que se deriva de la interacción de los usuarios con los juegos serios (Alonso-Fernández et al., 2019), de esta forma es posible identificar patrones de juego que permitan al especialista tomar decisiones basadas en datos (Provost & Fawcett, 2013). ...
Full-text available
En el campo de la psicología, los juegos serios se han transformado en herramientas digitales que permiten la aplicación de test psicológicos, el entrenamiento de competencias, y la detección de trastornos o patologías. Si bien los sistemas interactivos generan grandes cantidades de datos que pueden ser almacenados, surge la necesidad de identificar patrones de juego que permitan al especialista tomar decisiones basadas en datos. En este contexto, los modelos de visualización se han convertido en una herramienta moderna y precisa para solventar estas representaciones. El objetivo del presente trabajo es crear un modelo de visualización aplicado a datos extraídos de un juego serio orientado al entrenamiento de atención y memoria. Para ello, se propuso una metodología que permitió el desarrollo de un entorno unificado de análisis visual compuesto por tres tableros interactivos. Finalmente, el modelo fue evaluado a través del modelo de aceptación tecnológica, demostrando una fiabilidad sobresaliente.
... Hal ini mendapatkan perhatian dari berbagai sumber, termasuk studi akademis, laporan perusahaan, dan pengalaman bisnis praktis. Berikut ini beberapa fungsi sistem informasi perusahaan (UAGC , 2023;Davoren, 2019;Alter, 1976;Simplilearn, 2023 (Joshi, 1998;Bigelow, 2023;Rao, 2015;Fauzi, et al., 2023;Laudon & Laudon, 2012;Mohamed, Mahadi, Miskon, Haghshenas, & Adnan, 2013;Ricciardi, Zardini, & Rossignoli, 2018;Correia, Rocha, Duclós, & Veiga, 2021;Madonsela, 2020) (Provost & Fawcett, 2013;Sherman, 2015;Hartatik, et al., 2023;Boyer, Frank, Green, Harris, & Vanter, 2010;Sauro, 2015;Sudirjo, et al., 2023;Flynn, 2023;Lutkevich, 2023; Harvard Business Analytics Program, 2023) pada: (Harto, 2023;Fauzi, et al., 2023;Wakil, et al., 2022;Saputra, et al., 2023;Harto, et al., 2023;Krogstie, 2011;Kuiken, 2022;TCW, 2023) mempengaruhi masa depan SIB: Pada periode reinvention banyak fitur unik dari teknologi ecommerce dan internet yang datang secara bersamaan dalam satu set aplikasi dengan teknologi media sosial yang disebut sebagai web 2.0 (Laudon et al. 2014, 16). Menurut Turban et al. (2015, 20) • Pelaksana: menyampaikan keterampilan teknik yang diperlukan untuk merekayasa sebuah produk atau aplikasi. ...
Full-text available
Buku ini juga membahas Teknologi Informasi dalam Bisnis, menggambarkan bagaimana teknologi modern memungkinkan efisiensi, inovasi, dan keunggulan kompetitif. Analisis dan Perancangan Sistem menjadi langkah berikutnya, memberikan panduan tentang cara merancang sistem sesuai dengan kebutuhan bisnis.
... This may assist marketers to develop new products and features. Analytics techniques may also help provide feedback during the experimentation, since firms may conduct experimentation to understand how consumers react to certain offerings (Provost & Fawcett, 2013). ...
Full-text available
The advent of analytics has been considered as a revolution in how firms treat data. However, the implementation of this technology is still an issue for researchers and practitioners. Many business leaders still reluctant to adopt this technology since it is still unclear how analytics adds value on firms. There are limited studies that incorporate analytics and capabilities, particularly in marketing. Moreover, there is also lack of empirical studies that cover the adaptive side of marketing capabilities. This study aims to close the gap by constructing a sound and holistic measurement of firms' marketing capabilities (MCs) to measure the impact of analytics-related adoption. Integrative literature review was carried out to synthesize previous MCs constructs. Then, analysis was conducted to develop a new construct of MCs measurement. The result of this study proposes a more detailed and comprehensive framework in understanding the impact of marketing analytics (MA) adoption on all dimensions of MCs, including adaptive side of MCs.
Data scientists have become one of the coolest professions of the twenty-first century. Amid the rising and unmet demand for data scientists, organizations are offering the most lucrative salary packages possible to this profession, and education institutions are scrambling to offer data science courses. On the other hand, despite tourism being a highly information-cum-data intensive sector, where big data and digitalization increasingly play a greater role, the literature on data science is extremely limited. Therefore, the present study has attempted to scholarly postulate the needs, skills, and scope of data scientists in the tourism sector. The needs section is built upon three types of trends such as megatrends, micro-trends, and sectoral trends, which justify the need for data scientists in tourism. A general notion of the data scientist skillset, the main focus of this study, is conceptualized in the next section. The present study also calls for future in-depth research studies on the aspect of data science skills especially in the context of the tourism domain. Lastly, the scope of data scientists in the tourism field has been framed across the business sector, research sector, governance sector, and smart tourism, including the future growth prospects and the importance of these skills in human lives and society.
A number of machine learning process models (SEMMA, KDD, CRISP-DM, CRISP-ML(Q), Data-to-Value, etc.) have been recently proposed to facilitate the development of machine learning models in their organizational context. While the existing proposals vary with respect to complexity and suitability for particular tasks, it would be desirable to have software tools that embody and support these methodologies and make it easier for project teams to capture, share among team members and stakeholders, and preserve the relevant project information pertaining to the various process stages. Various existing software systems cover parts such as team and communication management (Confluence, Jira, Slack, Zoom, etc.), project management (scrum, kanban, etc.), data and information management (Model Management Platform, cf. (Weber and Hirmer, Business Information Systems. Springer International Publishing, Cham, 2020), inter alia), or experimentation (RapidMiner, Orange, Weka, Tensorflow, etc.), but we are not aware of any management tools that tie them together and ensure methodology compliance. To the best of our knowledge, to date, no requirement analysis exists for a system that meets the need to provide guidance to teams for how to follow a machine learning methodology nor for managing all of a project’s metadata throughout its entire life cycle. To this end, we present an analysis and resulting collection of a set of 29 requirements for the software tooling for machine learning methodologies, derived from properties of the methodologies, user stories, and introspection of the authors.
Full-text available
The analytics function is growing in importance as the digitisation of business operations and markets leads to the generation of ever-increasing amounts of data. Analysing this data in a manner aligned with company priorities and structures can generate value through supporting effective decision-making, rapid product innovation, supply chain visibility and other aspects of intra- and inter-company operations. To guide the growth we derive a novel maturity framework focused on driving the Analytics-Business alignment, covering a number of diverse organisational facets such as data, leadership support, processes, data management, governance, technology and people. It differentiates itself by using a firm theoretical foundation and providing guidance for analytics capability development instead of simply diagnosing the existing maturity level. To guide development, it distinguishes between two aspects of maturity – a “state” aspect, which is used to assess the present situation in an organisation, and a “management” aspect, which evaluates management attitude in order to establish the next stage of analytics growth. The framework has been implemented in a web-based tool and its utility has been demonstrated by obtaining feedback from 64 managers from a variety of sectors, who have praised its ability to integrate diagnosis of the current situation with guidance on the next steps necessary to develop analytics maturity.
Full-text available
In the data mining process, the data used is often given to other parties and there is a possibility that the privacy in the data is breached to unauthorized parties. Data privacy can be misused by those parties. In order to avoid this, privacy-preserving data mining must be carried out. One way achieve privacy-preserving data mining is to randomize the data using the randomization method. The randomization method works by randomizing the data while preserving their attributes needed for data mining. In this research, software is built which implements two techniques that use the randomization method, namely Random Rotation Perturbation and Random Projection Perturbation technique. The test on the output from the software is carried out by applying classification data mining with the algorithm k-nearest neighbors, and data mining clustering with k-means algorithm respectively, to calculate the accuracy of the model and the similarity of the clusters. Based on the test results, we can determine if models trained with original and randomized datasets using Random Rotation Perturbation or Random Projection Perturbation are the same or similar. Both techniques can be used only for data that is numeric and, in particular, the Random Projection Perturbation technique can be used only for data that meets the technical requirement, i.e., the number of features in the data must be sufficient.
ResearchGate has not been able to resolve any references for this publication.