ArticlePDF Available
Journal of Sports Analytics 4 (2018) 85–93
DOI 10.3233/JSA-170171
IOS Press
85
Using multi-class classification methods
to predict baseball pitch types
Glenn Sidleand Hien Tran
Department of Applied Mathematics, North Carolina State University, Raleigh, NC, USA
Abstract. Since the introduction of PITCHf/x in 2006, there has been a plethora of data available for anyone who wants to
access to the minute details of every baseball pitch thrown over the past nine seasons. Everything from the initial velocity
and release point to the break angle and strike zone placement is tracked, recorded, and used to classify the pitch according
to an algorithm developed by MLB Advanced Media (MLBAM). Given these classifications, we developed a model that
would predict the next type of pitch thrown by a given pitcher, using only data that would be available before he even stepped
to the mound. We used data from three recent MLB seasons (2013-2015) to compare individual pitcher predictions based
on multi-class linear discriminant analysis, support vector machines, and classification trees to lead to the development of
a real-time, live-game predictor. Using training data from the 2013, 2014, and part of the 2015 season, our best method
achieved a mean out-of-sample predictive accuracy of 66.62%, and a real-time success rate of over 60%.
Keywords: Baseball, pitch prediction, machine learning, PITCHf/x
1. Introduction
Ever since Bill James published his work on saber-
metrics four decades ago, Major League Baseball
(MLB) has been on the forefront of sports analyt-
ics. While the massive amount of statistical data is
generally used to examine historical player perfor-
mance on the field in an effort to make coaching and
personnel decisions, there is great potential for pre-
dictive models that has largely gone unnoticed. The
implementation of the PITCHf/x system and distri-
bution of large amount of pitching data publicly over
the internet has sparked the use of machine learning
methods for prediction, not just analysis.
As shown in Fig. 1, pitchers have been getting bet-
ter and better at preventing hits, lowering the average
ERA and batting average across the league. While
hitting a major league pitch will always be an incred-
ibly difficult task, in this work we hypothesize that
knowing what type of pitch is coming next may help
the batter decide to swing or not and work to get
Corresponding author: Glenn Sidle, Department of Applied
Mathematics, North Carolina State University, 2108 SAS Hall,
Box 8205, Raleigh, NC 27695, USA. Tel.: +1 860 941 5928;
E-mail: gsidle.math@gmail.com.
him on base. In this paper, we compare three differ-
ent machine learning techniques and their predictive
abilities, seeking to find what feature inputs are the
most informative and to develop a blind prediction
in an attempt to anticipate the next pitch type. By
comparing different techniques, we were able to find
which would work best in a live game environment,
attempting to predict the next type of pitch as before
it would be thrown.
1.1. Literature review
Previous research has mostly focused on a binary
prediction, commonly a fastball vs. non-fastball split.
This prediction has resulted in accuracy around
70% (Ganeshapillai and Guttag, 2012), with varying
degrees of success for individual pitchers and differ-
ent methods. This binary classification using support
vector machines was expanded upon using dynamic
feature selection by Hoang (2015), improving the
results by approximately 8 percent. Because many
classification methods were originally designed for
binary classification, this prediction method makes
sense, but we wish to expand it further into predicting
multiple pitch types.
2215-020X/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved
This article is published online with Open Access and distributed under the terms of the Creative Commons Attribution Non-Commercial License (CC BY-NC 4.0).
86 G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types
Fig. 1. League-wide Batting Average and ERA over the past
decade.
Very limited attempts have been made at multi-
class prediction prior to our work. Bock (2015) used
a support vector machine approach with a linear ker-
nel function, focusing mostly on finding a measure
of predictability for a pitcher and comparing that
predictability against long-term pitcher performance.
The authors only examined four pitch types, and had
a very limited out-of-sample testing set, looking at
the methods accuracy on 14 pitchers during a sin-
gle World Series, with an average accuracy of just
under 61%. Woodward (2014) used decision trees to
again only predict up to four different types of pitches,
and only shows results for two individual pitchers.
We expand on their work by considering up to seven
different pitch types and apply machine learning to
many more pitchers over a longer amount of time.
Along with a much larger test set, we also investi-
gate which methods perform the best, if our results
share any correlation with standard pitching statistics,
what variables are most important for prediction, and
finally our ability to predict the next pitch type in a
live game situation.
2. Methods
2.1. Data
The introduction of PITCHf/x was a revolutionary
development in baseball data collection. Cameras,
installed in all 30 MLB stadiums, record the speed
and location of each pitch at 60 Hz, and data is made
available to the public through a variety of online
resources. MLB Advanced Media uses a neural-
network based algorithm to classify those pitches,
giving a confidence in the classification along with
the type of pitch (Fast, 2010). This information is
added to the PITCHf/x database, along with the mea-
sure characteristics of the pitch and details about
the game situation. Using the pitch data provided at
gd2.mlb.com, we were able to retrieve every pitch
from the 2013, 2014, and 2015 seasons. Using 22 fea-
tures from each pitch, we created data sets for every
individual pitcher, adding up to 81 additional features
to each data set, depending on how many types of
pitches the individual threw. Here we consider seven
pitch categories (with given integer values), fastball
(FF, 1), cutter (CT, 2), sinker (SI, 3), slider (SL, 4),
curveball (CU, 5), changeup (CH, 6), and knuckle-
ball (KN, 7), and those that had a type confidence
(the MLBAM algorithm’s confidence that its classi-
fication is correct) greater than 80%.
We restricted our data set only to pitchers who
threw at least 500 pitches in both the 2014 and 2015
seasons, which left us with 287 total unique pitchers,
150 starters and 137 relievers as designated by ESPN.
The average size of the data set for each pitcher was
4,682 pitches, with the largest 10,343 pitches and the
smallest 1,108 pitches. Because each pitcher threw
a different number of unique pitch types, not all the
datasets are the same size. At the most, a pitcher could
have 103 features associated with each pitch and at
the minimum he could have 63. The average pitcher
had 81 features.
Table 1 gives a list of the features used, both those
that can be taken from the immediate game situation
and the features we generated using the historical data
on for both pitcher and batter. Because of the size of
the feature set, similar features are grouped together
in the table, i.e. group 16 contains the previous pitch’s
type, result, break angle, break length, break height,
and the zone where it crossed the plate. Groups 19–26
have variable sizes due to the number of types of
pitches each pitcher throws. Groups 27–29 are unique
for each batter, containing the percent of each type
of pitch he puts in play, has a strike on, or takes a
ball on.
2.2. Model development
We decided to employ three different classifi-
cation-based methods to compare and contrast results
from all 287 pitchers in our data set. First, we used
multi-class Linear Discriminant Analysis because its
speed and efficiency. Then, to compare results to
Bock (2015) and Woodward (2014) we used support
vector machines and classification trees. In an effort
to reduce model variance between different models
and increase accuracy, we employed a committee
G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types 87
Table 1
Feature groups for each pitch. Tendency refers to the percentage of each pitch type
Number Feature Group Type of Variable
1 Inning Categorical
2 Top or Bottom Binary
3 Outs Categorical
4 Order Position Categorical
5 Total At-Bat Categorical
6 Score Spread Categorical
7 Time of Day Categorical
8 Batter Handedness Binary
9 Strikes Categorical
10 Balls Categorical
11 On Base Binary
12 Base Score Categorical
13 Previous At-Bat Result Categorical
14 Previous Pitch Result Categorical
15 Previous Pitch Type Categorical
16 Previous Pitch Location Categorical
17 Pitch Number Categorical
18 Previous Pitch Speed, Break Angle, Continuous
Break Length, Break Height
19 Previous 5 Pitch Tendency Continuous
20 Previous 10 Pitch Tendency Continuous
21 Previous 20 Pitch Tendency Continuous
22 Previous 5 Pitch Strike Tendency Continuous
23 Previous 10 Pitch Strike Tendency Continuous
24 Previous 20 Pitch Strike Tendency Continuous
25 Pitcher Historical Tendency Continuous
26 Pitcher Tendency vs. Batter Continuous
27 Batter Strike Tendency Continuous
28 Batter In-Play Tendency Continuous
29 Batter Ball Tendency Continuous
method, using ten of each model type and taking the
majority vote as output.
2.3. Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a method
descended from R.A. Fisher’s linear discriminant
first introduced by Fisher (1936). Assuming two
classes of observations have respective mean covari-
ance pairs (μ0,
0) and (μ1,
1), then the linear
combinations w·xhave the mean covariance pairs
(w·μi,wTiw) for i=0,1. Fisher determined the
separation between the two distributions (and there-
fore classes) as the ratio of the variance between the
two classes to the variance within each class, i.e.
S=σ2
between
σ2
within
=(w·μ1−w·μ0)2
wT1w+wT0w=(w·(μ1−μ0))2
wT(0+1)w,
where the maximum separation between the classes
is found when w(0+1)1(μ1−μ0).
To go from the linear discriminant to LDA, we
use the assumption that the class covariances are
the same, i.e. 0=1=, then the proportional
equation leads to w·x>c, where
w=1(μ1−μ0)
c=1
2(μT
11μ1−μT
01μ0)
and so the decision of which class xbelongs to
depends on whether the linear combination satisfies
the inequality. In order to extend LDA to multi-
class classification, the same assumption is made that
each of Nclasses has a unique mean μibut the
same covariance . The between class covariance is
found by
b=1
N
N
i=1
(μi−μ)( μi−μ)T,
where μis the mean of the means, and then class
separation is determined by
S=wTbw
wTw.
88 G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types
We use a regularization parameter to shrink the
covariance matrix closer to the average eigenvalue
of . In the MATLAB implementation of LDA,
N(N1)/2 separations are made, making it compa-
rable to the one-vs-one (OvO) technique for support
vector machines. The final class decision is made
by minimizing the sum of the misclassification error
(MathWorks, 2016a).
2.4. Multi-class Support Vector Machines
Support vector machines (SVM) were first
designed as a method of binary classification. The
extension of binary SVMs to a multi-class method has
led to two common approaches to multi-class classi-
fication. Unlike Bock (2015), however, we employed
aC-SVM formulation with the radial basis kernel
function. For a problem with Ndistinct classes,
the one-vs-all (OVA) method creates NSVMs, and
places the unknown value in whatever class has the
largest decision function value.
For our model, we used the one-vs-one (OVO)
method, which creates N(N1)/2 SVMs for
classification. For a set of ptraining data,
(x1,y
1),...,(xp,y
p) where xlRn,l =1,...,p
and yl∈{1,...,N}is the class label for xl, then the
SVM trained on the ith and jth classes solves
min
wij ,bij ij
1
2(wij )Twij +C
ξij
where (wij )Tφ(x)+bij 1ξij
if y=i,
or (wij )Tφ(x)+bij ≤−1+ξij
if y=j,
ξij
0.
where φ(x)=eγ||xixj|| is the radial basis function
used to map the training data xi,j to a higher dimen-
sional space and Cis the penalty parameter. After all
comparisons have been done, the unknown value is
classified according to whichever class has the most
votes from the assembled SVMs.
Figure 2 shows a basic representation of the dif-
ferences between the OVA and OVO methods. The
middle triangle formed by the OVA method leaves
a gap where the classification algorithm can fail to
place an unknown value, but since the OVO method
does not have any blind spots, we used it for our clas-
sification. We used a modified grid search from Finkel
(2003) to optimize for both parameters Cand γ, using
the five-fold cross-validation accuracy of the training
set to find the optimal parameters.
Fig. 2. A visual representation of the one-vs-all method (thin
lines) compared to the one-vs-one method (thick line) from (Aisen,
2006).
2.5. Classification trees
In order to compare our results to Woodward
(2014), we also implemented classification trees,
using random forests. We used the MATLABCART
(Classification And Regression Trees) implementa-
tion that creates aggregate random forests of trees,
known as TreeBagger.
Classification Trees are a specific type of binary
decision tree that give a categorical classification
as an output. The input set is used in the nodes of
the tree, determining which feature to split at each
node and what criterion to base that decision on.
To determine node impurity, MATLAB classifica-
tion trees use the Gini diversity index (gdi), given
by I=1N
i=1p2(i), where Nis the number of
classes and p(i) is the fraction of each class ithat
reaches the node. The gdi is a measure of the expected
error rate at the node if the class is randomly selected
according to the distribution of the classes at that
node.
For each node, the tree first computes the impurity
at the node, then sorts the observation to determine
what features can be used as splitting criteria or
cut points. For all splitting points, the tree seeks to
maximize the decrease in impurity, I, by split-
ting the observations at the node into two child
nodes, then measuring I for each node with dif-
ferent cut points. Once a feature is chosen as the
best splitting candidate, then the feature is split
using the best cut point, and the process is repeated
until the total impurity is minimized and the end
leaf nodes are found. To combat overfitting, how-
ever, classification trees are pruned by merging
leaves that have the most common class per leaf
(MathWorks, 2016b).
G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types 89
Table 2
100 CT pitch-specific model predictions for Odrisamer Despaigne, overall accuracy 53.30%
Predicted Pitch Type
FF CT SI SL CU CH KN % Thrown % of Each Correct
Actual Type
FF 330 5 77 2 17 2 0 28.60 76.21
CT 55 42 19 2 1 8 3 8.59 32.31
SI 108 5 312 2 19 10 0 30.12 68.42
SL 31 2 23 10 2 6 2 5.02 13.16
CU 36 2 43 3 54 3 1 9.38 38.03
CH 72 1 34 0 2 33 2 9.51 22.92
KN 63 3 30 3 5 3 26 8.78 19.55
Table 3
Average values to compare support vector machines (SVM), random forests of 100 classification trees
(100 CT), and linear discriminant analysis (LDA), with average naive accuracy 54.38%.
Value LDA SVM 100 CT
Prediction Accuracy (%) 65.08 64.49 66.62
# of Predictions >Naive 263 251 282
% of Predictions >Naive 91.64 87.46 98.26
¯
PI(%) 10.70 10.11 12.24
¯
PB(%) 13.26 12.38 12.52
¯
PW(%) –9.08 –5.40 –1.15
Range of Committee (%) 1.52 3.02 2.22
Time (s) 22.75 2,383.8 72.05
3. Results
3.1. Overall prediction accuracy
To establish a value for comparison, we found
the best “naive” guess accuracy, similar to that used
by Ganeshapillai and Guttag (2012). We define this
naive best guess to be the percent of time each
pitcher throws his preferred pitch from the training
set in the testing set. Consider some pitcher who
throws pitch types 1, 2, 4, and 5 with distribution
P={p1,p
2,p
4,p
5}where pi=1, and his pre-
ferred training set pitch is max(Ptrain)=p2, then the
naive guess for the pitcher is Ptest(p2). For example,
since Jake Arrieta threw his sinker the most in the
training set (26.31%), we would take the naive guess
as the percentage of the time he threw a sinker in the
testing set, which gives a naive guess of 34.03%. With
the random forest method, we predicted 48.33% of
his pitches correctly, so we beat the naive guess by
14.30% in his case. For all 287 pitchers, the naive
guess was 54.38%.
Table 2 shows a breakdown of each type of pitch
predicted for Odrisamer Despaigne by the random
forest method, as well as the accuracy for each spe-
cific pitch, showing that for each pitch type, the
accuracy of the model prediction improves the naive
guess. We take this style of comparison from Wood-
ward (2014), who gives an outline of a decision tree
based prediction model, but does not go into detail or
use more than a handful of examples, so we cannot
fully compare to his results.
Table 3 shows the prediction results from each
individual method. On average accuracy alone, Clas-
sification Trees had the best prediction accuracy of
66.62%. The number of pitchers we predicted better
than naive is given, as well as the percentage of the
287 total pitchers that number represents. The aver-
age prediction accuracy is shown in Table 4, given
along with the overall average improvement over the
naive guess, denoted ¯
PI, the average improvement for
those pitchers who did beat the naive guess, denoted
as ¯
PB, and the average amount the pitchers who did
not beat the naive guess failed by, denoted by ¯
PW.
Given the number of pitchers Nwith respective pre-
diction value Piand naive guess Gi, the number who
did better than the naive guess, NB, the number who
did worse than the naive guess NW,wefind
¯
PI=1
N
N
i
(PiGi)¯
PB=1
NB
NB
i
(PiGi)
¯
PW=1
NW
NW
i
(PiGi).
We also give the average range of accuracy
between the most and least accurate members of
each committee as well as the average time for each
90 G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types
Table 4
Average Prediction accuracy for each pitch count for
the Classification Tree method. Pitcher favored counts
are shown in bold, batter-favored counts in italics.
Count (B-S) 100 CT
0-0 71.48
0-1 64.77
0-2 62.27
1-0 70.01
1-1 61.15
1-2 58.94
2-0 74.78
2-1 67.43
2-2 59.84
3-0 83.00
3-1 75.62
3-2 67.53
pitcher’s model to be trained and tested. As shown
in Table 3, the random forests of classification trees
outperformed both LDA and SVM by a wide margin.
Basing the judgement solely on how many pitchers
were predicted better, the random forests were near-
perfect, leading the average prediction accuracy and
improvement to also be higher. LDA outperforms the
random forests only when we examine the average
improvement for those pitchers who we are able
to beat the naive guess for, but conversely also has
much worse performance for the pitchers we do not
beat the naive guess for. At this stage, we undertook
further comparative analysis to determine if the
random forests were the best method overall.
3.2. Prediction by count
A common analysis in any pitch prediction is
breaking down prediction success rate by each pitch
count. There are twelve possible counts for any at-
bat, three where the batter and pitcher are even, three
where the pitcher is ahead in the count (more strikes
than balls), and six where the batter is ahead (more
balls than strikes). Similar to other works, On the very
batter-favored count of 3-0, we were able to predict
104 pitchers (36.24% of 287 pitchers) totally correct,
i.e. for every pitch they threw on a 3-0 count, we pre-
dicted them all exactly. The total counts and average
success rates for the random forest classification tree
method are given in Table 4. Pitcher ahead counts are
bolded, batter-favored counts are italicized.
The high success rate on counts in which the batter
is ahead is not surprising, given that a pitcher is more
likely to throw a controllable pitch in order to even
the count or to avoid a walk. Batter-behind counts
give the pitcher much more freedom, which explains
the lower average predictability.
3.3. Comparison with standard statistics
In an effort to determine if the prediction suc-
cess correlated with any standard measure of pitcher
success, we ran a linear regression analysis to find
the correlations between the random forest model
prediction accuracy and the improvement over the
naive guess to the pitchers’ wins-above-replacement
(WAR) and fielding-independent-pitching (FIP)
statistics. FIP is an extension of a pitchers’ earned
run average (ERA) that examines only outcomes over
which the pitcher had control. To find a measure of
the diversity a pitcher’s pitch selection, we also com-
pared the prediction accuracy, improvement, WAR,
and FIP to the Herfindahl-Hirschman Index (HHI).
HHI is inversely proportional to the diversity of pitch
selection, and can be expressed as
H=
N
i=0
p2
i
where piis the percentage of each unique type a
pitcher throws. As a measure of pitch diversity, the
HHI ranges from 0.1428 (if a pitcher throws all 7
pitches an equal amount) to 1 (if a pitcher throws a
single pitch all the time). The correlation coefficients,
along with the intercepts for each best-fit regression
line, are shown below in Table 5 alongside the R2
scores. We include the T-statistics and P-values for
all the coefficients as a measure of confidence in the
values.
Overall, given the very low R2values for most pairs
of variables we examined, the only correlation that
we can draw any conclusions from is between the
HHI and overall model accuracy, shown in Fig. 3.
The higher the HHI (and therefore the less diversity
in pitch selection), the more accurate the model is,
most likely due to the fact that the pitcher is throwing
a single pitch type a lot. While the regression slopes
of the other pairings might suggest interesting corre-
lations, due to the excessively small R2values there is
little to no mathematical strength to any conclusions
we could draw.
4. Variable importance
Post-processing techniques can be used to deter-
mine what features are the most important in a
G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types 91
Table 5
Linear regression R2values, slopes, intercepts, p-values, and t-statistics for the given random forest model accuracy (Acc) and improvement
(Imp) values and standard statistics
X Var Y Var R2Slope Intercept Slope P-val Slope T-stat Int. P-val Int. T-stat
HHI Acc 0.777 0.747 0.332 0.000 31.323 0.000 29.897
HHI Imp 0.069 –0.279 0.248 6.752e-6 –4.588 0.000 8.744
HHI FIP 0.039 –1.437 4.512 0.001 –3.370 0.000 22.699
Acc FIP 0.039 –1.703 5.002 0.001 –3.382 0.000 14.711
Acc WAR 0.014 –1.824 2.358 0.042 –2.039 1.191e–4 3.903
Imp WAR 0.012 –1.335 1.305 0.063 –1.865 0.000 10.023
HHI WAR 0.001 –0.399 1.321 0.602 –0.523 2.455e–4 3.714
Imp FIP 7.997e–6 0.019 3.865 0.962 0.047 0.000 51.733
Imp FIP 0.041 –0.363 4.602 0.368 –0.902 0.000 20.526
HHI –1.539 0.001 –3.479
Acc FIP 0.041 –0.927 4.819 0.385 –0.870 0.000 11.870
HHI –0.745 0.410 –0.825
Acc WAR 0.040 –6.311 3.416 0.001 –3.371 2.699e–6 4.791
HHI 4.313 0.007 2.719
Imp WAR 0.016 –1.541 1.703 0.039 –2.077 2.645e–5 4.273
HHI –0.829 0.293 –1.054
Fig. 3. Linear regression fit line between the Herfindahl-
Hirschman Index and the random forest model prediction accuracy,
with intercept 0.332 and correlation 0.746.
model, so we used the models created for the results
previously discussed to find measures of variable
importance with the permuted variable delta error
(PVDE) for the random forests of classification trees.
The PVDE is found during the construction of each
random forest for each variable by first finding the
expected error (EOi) against a hold-out validation set,
similar to the cross-validation used for the parameter
optimization. The values for a particular variable xi
are then randomly permuted across every observation
in the subset of the training data used for the tree con-
struction, and the expected error value (EPi) is found
against the same holdout set.
Table 6 gives the ranks of the permuted variable
delta error for each input feature group (with 29
Table 6
Variable Importance for Permuted Variable Delta Error for all
pitchers. 1 means highest importance, 29 means lowest importance
Feature Group PVDE
Inning 16
Top or Bottom 29
Outs 27
Order Position 18
Total At-Bat 6
Score Spread 21
Time of Day 25
Batter Handedness 7
Strikes 2
Balls 3
On Base 28
Base Score 19
Previous At-Bat Result 24
Previous Pitch Result 10
Previous Pitch Type 4
Previous Pitch Location 8
Pitch Number 1
Previous Pitch Stats 5
Previous 5 Pitch Tendency 13
Previous 10 Pitch Tendency 17
Previous 20 Pitch Tendency 14
Previous 5 Pitch Strikes 11
Previous 10 Pitch Strikes 15
Previous 20 Pitch Strikes 20
Pitcher Historical Tendency 26
Pitcher Tendency vs. Batter 23
Batter Strike Tendency 22
Batter In-Play Tendency 12
Batter Ball Tendency 9
feature groups, total), respectively. The ranks were
found by first averaging the values for each pitcher,
then sorting those averages by magnitude, and then
averaging each rank across each variable in the group.
Once the group ranks were found, we sorted the
92 G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types
Table 7
Python live pitch predictions for September 1st through October 2nd, 2016, with overall accuracy 59.07%
Predicted Pitch Type
FF CT SI SL CU CH KN % Thrown % of Each Correct
Actual Type
FF 52774 329 871 2124 1300 681 22 52.13 90.83
CT 2688 1353 203 24 190 127 3 4.12 29.49
SI 693 171 5381 417 211 101 0 6.26 77.16
SL 12243 59 1475 3110 90 98 0 15.32 18.21
CU 9775 340 650 183 2092 232 0 11.91 15.76
CH 8539 165 982 318 307 911 0 10.07 8.12
KN9 000001830.17 95.31
averaged group ranks to find the overall importance.
The results of the analysis are mostly unsurprising,
but again are helpful to show that the model works
the way we would expect it to. The pitch number
can be a measure of how tired a pitcher is, which
would greatly affect his pitch choice. As shown by
the results by count, the number of balls and strikes
can also affect the pitch selection, so the inputs being
second and third most important is unsurprising.
5. Live pitch prediction
At the start of this research, one of the reasons
we examined different machine learning methods of
prediction was to determine what would work best
in real time in a live game environment. The previ-
ous experiments were all done in a “bulk” setting,
i.e. predicting all of the testing set all at once. While
this gives a way to measure the effectiveness of each
method, the construction of the testing datasets was
not reflective of the way a dataset would be built
during an actual baseball season. Any live predic-
tion training set could only be updated with after
each game, and would only show historical pitcher
or batter tendencies up to the day before a game was
played.
The data for the live predictions was parsed
appropriately, creating pitcher preferences and batter
performance measures up until the day being pre-
dicted. We examined the games in the regular season
of September and October 2016, creating models
for each pitcher for not only predicting the type of
pitch thrown, but also the speed of the pitch and
the location of the pitch. Models were created for
every pitcher who pitched in September and October,
as long as he had pitched at some point after the
All-Star break (mid-July) and before September 1st.
There was a large amount of data available to test
on, and the characteristics of the data are shown in
Table 7.
While the overall prediction accuracy for all the
pitches thrown was 59.07%, the average accuracy
across each pitcher in each game was 60.69%.
6. Conclusion and future work
Because Bock (2015) and Woodward (2014) are
the only examples of multi-class pitch prediction we
have found, they are the standard for comparison. An
example of pitch prediction using Markov Chains was
done by Malter (2016), but it is not a situational-based
model. Our model takes data that is available in the
moments before the next pitch is thrown and gives the
batter and manager better knowledge of what is com-
ing than he would have had beforehand. Our results
are better than any other purely predictive model of
a multi-class pitch type thus far.
Moving forward, we plan to employ a feature selec-
tion method similar to one used by Hoang (2015) to
find which inputs are the most important to the pre-
diction, or even if reducing the size of the feature
vectors may improve the prediction, as we may work
to avoid the curse of dimensionality. Due to the con-
struction of the multi-class problem, implementing
pre-processing techniques such as F-score or ROC
curve analysis may require the introduction of classi-
fication using a Directed Acyclic Graph. Using these
pre-processing techniques along with the informa-
tion learned from the variable importance may help
improve the live pitch predictors as well.
References
Aisen, B., 2006, A Comparison of Multi-Class SVM Methods.
URL: http://courses.media.mit.edu/2006fall/mas622j/Projects/
aisen-project/.
Bock, J.R., 2015, Pitch Sequence Complexity and Long-Term
Pitcher Performance, Sports, 40-55.
Fast, M., 2010, The Internet cried a little when you wrote that on
it, The Hardball Times.
G. Sidle and H. Tran / Using multi-class classification methods to predict baseball pitch types 93
Finkel, D., 2003, DIRECT Optimization Algorithm User Guide,
Center for Research in Scientific Computation, NCSU.
Fisher, R., 1936, The use of multiple measurements in taxonomic
problems, Annals of Eugenics 2, 179-188.
Ganeshapillai, G. and Guttag, J., 2012, Predicting the Next Pitch,
Proceedings of the MIT Sloan Sports Analytics Conference.
Hoang, P., 2015, Supervised Learning in Baseball Pitch Prediction
and Hepatitis C Diagnosis, NC State University, Ph.D. Thesis.
Malter, D., 2016, Using Markov Chains to Predict Pitches. URL:
http://danmalter.github.io/r/2016/03/28/Markov-chains.html.
MathWorks 2016a, Discriminant Analysis. URL: http://www.
mathworks.com/help/stats/discriminant-analysis.html.
MathWorks 2016b, fitctree.m documentation. URL: http://www.
mathworks.com/help/stats/fitctree.html.
Woodward, N., 2014, A Decision Tree Approach to Pitch Predic-
tion, The Hardball Times.
... WAR represents the indicator of how many victories a player can bring to the team, and it is then converted into the market value of the player. Sidle and Tran [2] collected pitcher competition data from 2013 to 2015 and used multi-class linear discriminant analysis, support vector machines (SVM), and decision trees (DTs) to predict next type of pitch; they developed a real-time and live-game predictor and finally achieved a real-time success rate of more than 60%. ...
Article
Full-text available
(1) Background and Objective: Major League Baseball (MLB) is one of the most popular international sport events worldwide. Many people are very interest in the related activities, and they are also curious about the outcome of the next game. There are many factors that affect the outcome of a baseball game, and it is very difficult to predict the outcome of the game precisely. At present, relevant research predicts the accuracy of the next game falls between 55% and 62%. (2) Methods: This research collected MLB game data from 2015 to 2019 and organized a total of 30 datasets for each team to predict the outcome of the next game. The prediction method used includes one-dimensional convolutional neural network (1DCNN) and three machine-learning methods, namely an artificial neural network (ANN), support vector machine (SVM), and logistic regression (LR). (3) Results: The prediction results show that, among the four prediction models, SVM obtains the highest prediction accuracies of 64.25% and 65.75% without feature selection and with feature selection, respectively; and the best AUCs are 0.6495 and 0.6501, respectively. (4) Conclusions: This study used feature selection and optimized parameter combination to increase the prediction performance to around 65%, which surpasses the prediction accuracies when compared to the state-of-the-art works in the literature.
... In essence, the prediction of baseball pitch is a binary classification problem. Previous studies have analyzed whether historical pitches would indicate the type of a future pitch [14], [15]. ...
... Discussions on pitching uncertainty rely on assumption that pitcher's uncertainty makes hitter puzzled at bat. Pitcher's performance including strike out rates are explained by pitch type uncertainty [15] and pitch type sequence [16][17][18][19][20]. ...
Article
Full-text available
Baseball is a game of numbers. Large-scale baseball data have been utilized in decision making process by team managers. Recent studies with PITCHf/x, a system that tracks every pitch, provide new insights on the role of the pitch-type sequence in the pitcher’s performance. These studies are based on the assumption that the pitch-type uncertainty puts the hitter at a disadvantage. However, the pitch-zone uncertainty, another factor of pitching uncertainty, is neglected in many cases. Here, we introduce normalized mutual information for pitch type and pitch-zone as an indicator of pitching uncertainty. A pitcher with less of a repertoire with respect to type-zone combination has high normalized mutual information in pitching. We calculate the pitch type - zone uncertainties for major league baseball starting pitchers and compare the results with field independent pitching, a metric of pitcher’s performance. Our analysis shows that normalized mutual information is uncorrelated with performance in nine distinct subgroups extracted from revealed comparative advantage in pitch type. This result underlines the importance of the pitcher’s repertoire and ability to be competitive in professional baseball. © 2010 National Institute of Optoelectronics. All rights reserved.
Article
Full-text available
The big data revolution has had an impact on sports analytics as well. Many large corporations have begun to see the financial benefits of integrating sports analytics with big data. When we rely on central processing systems to aggregate and analyze large amounts of sport data from many sources, we compromise the accuracy and timeliness of the data. As a response to these issues, distributed systems come to the rescue, and the MapReduce paradigm holds promise for large scale data analytics. We describe a big data architecture based on Docker containers with Apache Spark in this paper. We evaluate the architecture on four data-intensive case studies in sport analytics including structured analysis, streaming, machine learning approaches, and graph-based analysis.
Article
Full-text available
In the past decade, many data mining researches have been conducted on the sports field. In particular, baseball has become an important subject of data mining due to the wide availability of massive data from games. Many researchers have conducted their studies to predict pitch types, i.e., fastball, cutter, sinker, slider, curveball, changeup, knuckleball, or part of them. In this research, we also develop a system that makes predictions related to pitches in baseball. The major difference between our research and the previous researches is that our system is to predict pitch types and pitch locations at the same time. Pitch location is the place where the pitched ball arrives among the imaginary grids drawn in front of the catcher. Another difference is the number of classes to predict. In the previous researches for predicting pitch types, the number of classes to predict was 2∼7. However, in our research, since we also predict pitch locations, the number of classes to predict is 34. We build our prediction system using ensemble model of deep neural networks. We describe in detail the process of building our prediction system while avoiding overfitting. In addition, the performances of our prediction system in various game situations, such as loss/draw/win, count and baserunners situation, are presented.
Purpose Predictive analytics and artificial intelligence are perceived as significant drivers to improve organizational performance and managerial decision-making. Hiring employees and contract renewals are instances of managerial decision-making problems that can incur high financial costs and long-term impacts on organizational performance. The primary goal of this study is to identify the Major League Baseball (MLB) free agents who are likely to receive a contract. Design/methodology/approach This study used the design science research paradigm and the cognitive analytics management (CAM) theory to develop the research framework. A dataset on MLB's free agents between 2013 and 2017 was collected. A decision support tool was built using artificial neural networks. Findings There are clear links between a player's statistical performance and the decision of the player to sign a new offered contract. “Age,” “Wins above Replacement” and “the team on which a player last played” are the most significant factors in determining if a player signs a new contract. Originality/value This paper applied analytical modeling to personnel decision-making using the design science paradigm and guided by CAM as the kernel theory. The study employed machine learning techniques, producing a model that predicts the probability of free agents signing a new contract. Also, a web-based tool was developed to help decision-makers in baseball front offices so they can determine which available free agents to offer contracts.
Thesis
Full-text available
HOANG, PHUONG. Supervised Learning in Baseball Pitch Prediction and Hepatitis C Diagnosis. (Under the direction of Hien T. Tran.) Machine learning is so ubiquitous nowadays that one probably uses it multiple times during the day without realizing it. For example, it is used in web search engines to improve efficiency, by email providers to identify junk emails, and in voice recognition, among others. Machine learning is a powerful tool that can be used to analyze large amount of data to make actionable predictions. Since machine learning uses algorithms that iterate on data, the quality and quantity of training data are important factors for accurate predictions. In particular, the data available for baseball pitch prediction is huge, millions of observations (pitches) each containing more than fifty features. However, the prediction task restricts researchers to working only with the less than ideal features that were measured before the target pitch is thrown. In addition, the presence of noise in pitch type labels makes it even harder to train classifiers. Meanwhile, the dataset for Hepatitis C is fairly small with less than two hundreds observations and 20 features. This disadvantage prevents researchers from removing observations with low quality when building reliable diagnosis models. Hence, prediction problems in the presence of missing features are pervasive in machine learning. This thesis focuses on a number of classification methods and other machine learning tools, and tailor them to address the above issues specifically. First, in the pitch prediction problem, unlike the current method which suggests a static feature selection algorithm for each pitcher, we propose a novel dynamic feature selection procedure that is shown to be more adaptive for each pitcher in each count. The tradeoff is that the size of training data is reduced dramatically with pitcher-count data segmentation. Thus, we propose a simple heuristic approach for constructing and selecting features to include during training that are shown to surpass this tradeoff, which in turn yields considerable improvement in prediction accuracy. In the second part of the thesis, we propose a new learning algorithm for Hepatitis C diagnosis that addresses the important issue of class imbalance. Most existing learning algorithms simply ignore the presence of class imbalance due to the lucrative high accuracy that can be easily attained. The current method suggests combining over-sampling (minority class) and weighted cost in Support Vector Machine. Through our research study, however, we were able to show that doing both is unnecessary. We choose only to employ the later but add the parameter optimization procedure to improve classification performance. Our experimental results show that our proposed method is more accurate and reliable than the existing learning methods.
Article
Full-text available
Winning one or two games during a Major League Baseball (MLB) season is often the difference between a team advancing to post-season play, or “waiting until next year”. Technology advances have made it feasible to augment historical data with in-game contextual data to provide managers immediate insights regarding an opponent’s next move, thereby providing a competitive edge. We developed statistical models of pitcher behavior using pitch sequences thrown during three recent MLB seasons (2011–2013). The purpose of these models was to predict the next pitch type, for each pitcher, based on data available at the immediate moment, in each at-bat. Independent models were developed for each player’s most frequent four pitches. The overall predictability of next pitch type is 74:5%. Additional analyses on pitcher predictability within specific game situations are discussed. Finally, using linear regression analysis, we show that an index of pitch sequence predictability may be used to project player performance in terms of Earned Run Average (ERA) and Fielding Independent Pitching (FIP) over a longer term. On a restricted range of the independent variable, reducing complexity in selection of pitches is correlated with higher values of both FIP and ERA for the players represented in the sample. Both models were significant at the α = 0.05 level (ERA: p = 0.022; FIP: p = 0.0114). With further development, such models may reduce risk faced by management in evaluation of potential trades, or to scouts assessing unproven emerging talent. Pitchers themselves might benefit from awareness of their individual statistical tendencies, and adapt their behavior on the mound accordingly. To our knowledge, the predictive model relating pitch-wise complexity and long-term performance appears to be novel.
Article
If a batter can correctly anticipate the next pitch type, he is in a better position to attack it. That is why batteries worry about having signs stolen or becoming too predictable in their pitch selection. In this paper, we present a machine-learning based predictor of the next pitch type. This predictor incorporates information that is available to a batter such as the count, the current game state, the pitcher's tendency to throw a particular type of pitch, etc. We use a linear support vector machine with soft-margin to build a separate predictor for each pitcher, and use the weights of the linear classifier to interpret the importance of each feature. We evaluated our method using the STATS Inc. pitch dataset, which contains a record of each pitch thrown in both the regular and post seasons. Our classifiers predict the next pitch more accurately than a naïve classifier that always predicts the pitch most commonly thrown by that pitcher. When our classifiers were trained on data from 2008 and tested on data from 2009, they provided a mean improvement on predicting fastballs of 18% and a maximum improvement of 311%. The most useful features in predicting the next pitch were Pitcher/Batter prior, Pitcher/Count prior, the previous pitch, and the score of the game.
Article
The purpose of this brief user guide is to introduce the reader to the DIRECT optimiza- tion algorithm, describe the type of problems it solves, how to use the accompanying MATLAB program, direct.m, and provide a synopis of how it searches for the global minium. An example of DIRECT being used on a test problem is provided, and the motiviation for the algorithm is also discussed. The Appendix provides formulas and data for the 7 test functions that were used in (3).
A Comparison of Multi-Class SVM Methods
  • B Aisen
Aisen, B., 2006, A Comparison of Multi-Class SVM Methods. URL: http://courses.media.mit.edu/2006fall/mas622j/Projects/ aisen-project/.
A Decision Tree Approach to Pitch Prediction, The Hardball Times
  • N Woodward
Woodward, N., 2014, A Decision Tree Approach to Pitch Prediction, The Hardball Times.
The Internet cried a little when you wrote that on it, The Hardball Times
  • M Fast
Fast, M., 2010, The Internet cried a little when you wrote that on it, The Hardball Times.
Using Markov Chains to Predict Pitches
  • D Malter
Malter, D., 2016, Using Markov Chains to Predict Pitches. URL: 431