Content uploaded by Chenwi Neba Cyril
Author content
All content in this area was uploaded by Chenwi Neba Cyril on Nov 03, 2023
Content may be subject to copyright.
Content uploaded by Chenwi Neba Cyril
Author content
All content in this area was uploaded by Chenwi Neba Cyril on Oct 25, 2023
Content may be subject to copyright.
Comparative Analysis of Stock Price Prediction Models: Generalized
Linear Model (GLM), Ridge Regression, Lasso Regression,
Elasticnet Regression, and Random Forest –
A Case Study on Netflix
1Cyril Neba C., , 2Gillian Nsuh, 3Gerard Shu F., 4Philip Amouda A. , 5Adrian Neba F., 6Aderonke Adebisi, 7P. Kibet.,
8F.Webnda
1,4,5,6,7,8 Department of Mathematics and Computer Science, Austin Peay State University, Clarksville, Tennessee, USA
2 School of Business Analytics, University of Quinnipiac, Hamden, Connecticut
3 Montana State University, Gianforte School of Computing, Bozeman, Monatana, USA
Abstract
The primary objective was to develop a robust model for predicting the adjusted closing price of
Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory
Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018,
including opening price, highest price, lowest price, closing price, adjusted closing price, and trading
volume.
Our research aims to provide valuable insights and predictive tools that can assist investors and
market analysts in making informed decisions. The dataset presented a unique challenge, featuring
a diverse mix of quantitative and categorical variables, making it an ideal candidate for a
Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM
with a gamma(normal) family and a log link function, a suitable choice for modeling positive
continuous data with right-skewed distributions. The study also expands beyond the GLM
framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and
Random Forest models, enabling a comprehensive comparison of their predictive capabilities.
Based on the RMSE values, including the Volume variable did not significantly improve the
performance of the model in predicting Netflix stock prices. However, the difference between the
RMSE values of the two models was small and may not be practically significant. Therefore, it was
reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in
other scenarios. The analysis of the five models used for predicting the Netflix stock price based on
the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net
model had the second-best performance, then the Ridge model, followed by the Random Forest
Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in
predicting the stock price, but the Lasso and Elastic Net models stood out with the best
performance. These findings can be useful in guiding investment decisions and risk management
strategies in the stock market.
Keywords:- Stock Price Prediction, Generalized Linear Model (GLM), Ridge Regression, Lasso
Regression, Elasticnet Regression, Random Forest, RMSE, Netflix
1. Background
The stock market plays a pivotal role in the United States' economy, acting as both a barometer
of economic health and a vital driver of economic growth [4]. It serves as a mechanism for
companies to raise capital for expansion, innovation, and job creation. Additionally, it offers
opportunities for individuals to invest and grow their wealth. The stock market is integral to
various aspects of the economy, influencing interest rates, investment decisions, and overall
economic stability [5].
Moreover, the stock market reflects investor sentiment and economic conditions, with indices like
the Dow Jones Industrial Average and the S&P 500 providing insights into market performance
and economic prospects. A thriving stock market often correlates with a robust economy,
increasing consumer confidence and fostering economic growth [6].
However, predicting stock prices in this dynamic environment is challenging. Researchers have
explored various methods, including machine learning techniques, to forecast stock prices
accurately. These efforts aim to provide investors, financial institutions, and policymakers with
valuable insights into market trends and potential risks [7]. Similar machine learning models have
been used on other domains such as credit Card Fraud Detection [12] and Prediction of Death
caused by Ambient Ozone Pollution in the United States [13].
Stock price prediction is a multifaceted task involving the analysis of historical data, market
sentiment, and macroeconomic factors. Machine learning models, such as artificial neural
networks and support vector machines, have been employed to capture complex patterns in
stock price movements [8];[9]. Additionally, models like regime-switching GARCH have been
used to forecast market volatility [10].
The importance of accurate stock price prediction cannot be overstated. Investors rely on
forecasts to make informed decisions regarding buying, selling, or holding stocks. Financial
institutions use these predictions to manage portfolios and assess risk. Moreover, policymakers
monitor stock market trends as part of their economic policymaking.
The stock market therefore holds a central position in the United States' economic landscape,
influencing economic growth, investor sentiment, and economic policies. Predicting stock prices
is a crucial endeavor, and machine learning techniques have emerged as valuable tools for
providing insights into market behavior. These predictions empower investors, financial
institutions, and policymakers to navigate the complex world of stock markets with greater
confidence.
The stock market has consistently held the attention of investors, traders, and analysts due to its
significant influence on financial matters. Gaining insights into the intricacies of stock market
dynamics and formulating forecasts about its future performance are essential for making well-
informed investment choices. Recent years have witnessed a transformation in this arena,
thanks to the availability of extensive datasets and the advancement of sophisticated statistical
models. These developments have not only simplified the process of analyzing stock market
data but have also paved the way for the creation of predictive models that hold the potential to
optimize investment strategies and risk mitigation.
2. Methodolody
Our project revolves around the development and comparison of predictive models for
forecasting the adjusted closing price of Netflix, drawing from historical stock data available on
Kaggle. This dataset furnishes us with a comprehensive snapshot of February 2018, inclusive of
pivotal indicators such as opening and closing prices, high and low points, adjusted closing
prices, and trading volumes for each trading day.
At the heart of our exploration lie several sophisticated regression models and a formidable
machine learning technique, each poised to reveal insights into Netflix's stock price dynamics.
i. Generalized Linear Model (GLM):
The GLM stands at the crossroads of quantitative and categorical predictors, promising a
comprehensive view of Netflix's stock price movements. Rooted in the versatile R programming
language and powered by the glm function, the GLM model will serve as the foundation of our
predictive analysis. Its performance will be meticulously evaluated using established metrics
such as Mean Squared Error and R-squared. The insights derived from the GLM model offer
investors and market analysts valuable tools for understanding stock price behavior.
ii. Ridge Regression:
Ridge Regression, a variant of linear regression, introduces regularization to the model. It is
particularly useful when dealing with multicollinearity, a common issue in financial datasets. By
adding a penalty term, Ridge Regression helps prevent overfitting and provides a more stable
model.
iii. Lasso Regression:
Lasso Regression, another member of the linear regression family, is renowned for its feature
selection capabilities. It can identify the most influential predictors in the dataset and assign them
appropriate weights, promoting a simpler and more interpretable model.
iv. Elastic Net Regression:
Elastic Net Regression combines the strengths of Ridge and Lasso Regression. It provides a
balance between feature selection and regularization, making it adaptable to a wide range of
datasets. In our project, it aids in creating a model that is both interpretable and robust.
v. Random Forest:
Random Forest, a powerful ensemble learning technique, stands as a formidable addition to our
arsenal. Comprising a multitude of decision trees, it harnesses collective wisdom to deliver highly
accurate predictions. Its ability to capture complex interactions and nonlinear relationships within
the data adds depth and adaptability to our predictive modeling efforts.
By subjecting these diverse models to rigorous analysis and comparison, our project aims to
unravel the forces governing Netflix's stock price. These predictive tools, including GLM, Ridge
Regression, Lasso Regression, Elastic Net Regression, and Random Forest, are poised to
illuminate Netflix's future stock performance, offering invaluable insights to investors and
analysts alike.
Our dataset is a medley of predictors, marrying the realms of quantity and category. The
quantitative predictors encompass opening prices, high and low points, and trading volumes,
while the categorical predictor is the date, introducing a temporal dimension to our dataset.
In terms of the response distribution, the gamma(normal) family, coupled with a log link function,
takes center stage. This choice, grounded in statistical theory and affirmed by financial practice,
holds relevance for modeling positively skewed continuous data—a characteristic trait often
exhibited in financial data landscapes, including stock prices, asset returns, and exchange rates
[3]
2.1. Data preparation
The dataset was uploaded into R-studio software and then explored to see the data structure
and dimension which revealed that the dataset is composed of 7 variables or columns and 1009
rows or observations. Inspecting the dataset also revealed that there are no missing values as
shown by Figure 1 below.
Figure 1: Plot showing Missing Values in the Dataset
2.2. Exploratory Data Analysis(EdA)
i. Checking the data for normality and linearity
Scatterplots
Figure 2: Scatterplot Showing Linearity
scatterplots serve as valuable tools to assess the existence of a linear relation between each
predictor variable and the response variable. When the data points on the plot are evenly
distributed along a straight line, it signifies a linear relationship. Conversely, if the points
create a curved pattern, it indicates a non-linear relationship. Upon analyzing the scatterplots
above, it becomes evident that there exists a predominantly linear association between each
predictor variable and the response variable.
Normal Probability Plot of the Residuals
Figure 3: Plot of Normality
The normal probability plot of residuals aids in assessing the normal distribution of
residuals derived from the linear model. A straight-line pattern in the plot suggests that the
residuals exhibit normal distribution. Conversely, if the residuals systematically deviate from
the line, it indicates non-normal distribution. Upon examining the normal probability plot
above, it becomes evident that the residuals approximately adhere to normal distribution,
albeit with some departure from the line at the extremes. This signifies that the conditions
for linearity are satisfied, but the conditions for normality are somewhat violated, a common
occurrence in stock price analysis
Histograms Plot for each variable
Figure 4: Histogram plot for each variable
Upon examining the above histogram plots, it becomes apparent that they display a mild to
moderate right skewness, a common attribute observed in stock price datasets.
2.3. Predicting Netflix Adjusted Closing Price Using a GLM Model
Initially, we divided the dataset into training and testing subsets and subsequently proceeded to
establish a GLM model employing the gamma family and a log link function.
In this particular model, we deviated from the assumption of normality due to the right-skewed
nature of the dataset, as evident in the previously shown histograms. To address this
departure, we opted to model the Netflix stock price data using a log-normal (gamma)
distribution, given its characteristics of positivity and asymmetry.
i. Model 1
The summary output furnishes us with estimated coefficients for each predictor variable,
accompanied by their standard errors, t-values, and p-values. The intercept exhibits an
estimated value of 5.034, which holds statistical significance at the 0.001 level. Meanwhile, the
estimated coefficient for "Open" stands at -0.001447, signifying significance at the 0.01 level.
Conversely, the coefficients for "High" and "Low" portray positivity and hold statistical
significance at the 0.001 level and 0.01 level, respectively. Specifically, "High" and "Low"
possess estimated values of 0.002271 and 0.001507, respectively. However, the coefficient
pertaining to "Volume" lacks significance, featuring an estimated value of -1.000e-09 and a p-
value of 0.23470.
This model summary equips us with the coefficients of each variable, their corresponding
standard errors, t-values, and p-values. These values serve as instrumental tools for deciphering
the relationship between each variable and the Netflix stock price. For instance, a negative
coefficient associated with the "Open" variable signifies that an increase in the Open price is
anticipated to result in a decrease in the Close price, assuming all other variables remain
constant. In a similar vein, a positive coefficient attributed to the "High" variable implies that as
the High price ascends, the Close price is expected to decline, holding other variables steady.
Please note that given the lack of significance in the "Volume" coefficient, we will attempt
to exclude the volume variable and construct another model to assess potential
improvements.
ii. Model 2
2.4. Comparing Model 1 and Model 2
Upon examining the above outputs, it becomes evident that both the first model (Model 1) and
the second model (Model 2) exhibit an identical AIC value of 1600. However, Model 2 boasts a
superior performance in terms of BIC, as it registers a lower value of 1616 in contrast to Model 1,
which bears a higher BIC value of 1620. Consequently, we can reasonably deduce that Model 2
surpasses Model 1 in predictive capability.
It is crucial to acknowledge that a model characterized by a higher log-likelihood (loglik) is
deemed more precise when juxtaposed with a model featuring a lower log-likelihood. Log-
likelihood functions as a pivotal statistical metric employed to gauge the goodness of fit between
a model and the data at hand. Essentially, it quantifies the likelihood of observing the provided
data within the framework of the model's underlying assumptions. A heightened log-likelihood
value signifies that the model aligns more closely with the data, implying that the model is more
plausible as the generator of the observed data. Hence, Model 2, which boasts an elevated log-
likelihood (loglik) of -795, is ascribed a greater degree of accuracy relative to Model 1, which
lodges a diminished log-likelihood (loglik) of -794.
Comparing the RSME values for the Two models
The RMSE for the model excluding the Volume variable (Model 2) stands at 423.45864012155,
marginally edging out the RMSE of 423.45864568846 observed in the model inclusive of the
Volume variable (Model). Nonetheless, this disparity is exceedingly slight and likely lacks
practical significance. Consequently, we can ascertain that the omission of the Volume variable
has failed to yield a substantial enhancement in performance. As a result, we will continue to
employ the model encompassing all variables.
2.5. Predicted Netflix Stock Prices
2.6. Calculating the RSME, R-squared value, MAE by mean of Cross
Validation
i. Perform cross-validation
The RMSE (root mean squared error) is relatively diminutive, standing at 15.27758. This
implies that the model's forecasts closely align with the actual values, exhibiting an average
disparity of approximately 15 units.
An R-squared value of 0.9834965 underscores the model's adeptness in conforming to the
dataset. R-squared serves as an indicator of how well the model elucidates the variability in
the outcome variable, with values converging toward 1 denoting a superior fit. In this instance,
the R-squared figure nearly approaches 1, signifying that the model expounds upon a
substantial portion of the variability within the outcome variable.
The MAE (mean absolute error) also registers as relatively modest, measuring 10.71222. This
metric reflects the average distinction between predicted and actual values, and a lower MAE
signifies that the model's predictions exhibit a greater degree of precision.
2.7.Analyzing the Model Parameters Using Odds Ratios and Calculating the 95%
Confidence Interval for the Odds Ratios.
Analyzing Model Parameters Through Interpretation of the Odds
Ratio
A one-unit increase in the Open Netflix stock price corresponds to a 0.9985536 times
increase in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the High Netflix stock price corresponds to a 1.0022734 times increase
in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the Low Netflix stock price corresponds to a 1.0015079 times increase
in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the Volume of Netflix shares corresponds to a 1.0000000 times
increase in the odds of the Netflix stock Adj.Close price.
3. Applying Regularized GLM Models (Ridge, Lasso, and Elastic Net Regression) for
Forecasting Netflix Stock Prices and Assessing Their Performance in Comparison to
GLM Model_1.
Before building the model, we will take out the date column because date column does not
directly contribute to predicting Netflix stock prices in the dataset, it can still hold value for time
series analysis [14] or the generation of temporal features. However, for prediction purposes, we
will exclude it from the dataset.
Developing Regularized GLM Models (Ridge Regression, Lasso Regression, and
Elasticnet Regression)
i. Ridge Regression
ii. Lasso regression
iii. Elasticnet regression
• Developing Random Forest
rf_model <- randomForest(Adj.Close ~ Open + High + Low + Volume, data = train, ntree =
100)
[1] "Root Mean Squared Error (RMSE): 6.4001203741829"
3.1. Comparing RMSE of All the Models
i. GLM 14.303384
ii. Ridge 5.704350
iii. Lasso 3.249638
iv. Elastic Net 3.663382
v. Random Forest: 6.4001203741829
The Lasso model stands out with the lowest RMSE of 3.249638, signifying its superior
predictive performance among the five models.
Following closely, the Elastic Net model also exhibits a low RMSE of 3.663382, securing
its position as the second-best performer among the four models.
In contrast, the Ridge model lags with a higher RMSE of 5.704350, indicating
comparatively weaker predictive performance when compared to the Lasso and Elastic
Net models.
Similarly, the Random Forest model presents a relatively high RMSE of
6.4001203741829.
Lastly, the GLM model trails behind with the highest RMSE of 14.303384, suggesting the
least effective predictive performance among the five models.
Consequently, based on the RMSE values, the Lasso model emerges as the top-
performing model for forecasting the Netflix stock price, followed by the Elastic Net model,
the Ridge model, the Random Forest model, and lastly the GLM model.
Overall, all five models demonstrate accurate predictions, albeit with varying degrees of
precision.
4. Conclusion
Considering the RMSE values, it appears that the inclusion of the Volume variable did not
substantially enhance the model's performance when predicting Netflix stock prices.
Nevertheless, the disparity in RMSE values between the two models is minimal and may not hold
practical significance. Therefore, retaining the Volume variable within the model remains
reasonable, as it may serve as a valuable predictor in other contexts.
For instance, in high-frequency trading scenarios, where stocks change hands in seconds,
trading volume can offer crucial insights into market sentiment and influence stock prices, as
highlighted by [1] and [2]. In such cases, incorporating the Volume variable into the prediction
model can effectively capture the impact of trading volume on stock prices, resulting in more
accurate predictions. Moreover, in situations where investors intend to trade substantial stock
blocks, trading volume can impact stock liquidity, subsequently affecting its price. Hence,
preserving the Volume variable in a financial prediction model holds significance, particularly in
scenarios where trading volume plays a pivotal role in stock price dynamics.
Analyzing the five models used to predict Netflix stock prices based on Root Mean Squared
Errors (RMSE), it becomes evident that the Lasso model demonstrated superior performance,
boasting the lowest RMSE. Following closely is the Elastic Net model, followed by the Ridge
model, then the Random Forest model, with the GLM model lagging behind. Overall, all five
models exhibited a degree of accuracy in forecasting stock prices, with the Lasso and Elastic Net
models excelling. These insights can prove valuable in guiding investment decisions and
formulating risk management strategies within the stock market.
References
[1] Charles Schwab, 2021., "Trading Volume as a Market Indicator."
https://www.schwab.com/learn/story/trading-volume-as-market-
indicator
[2] Fidelity, 2022., "Turn Up the Volume on Stocks.,
https://www.fidelity.com/viewpoints/active-investor/stock-volume
[3] Kissell, R., & Poserina, J. (2017). Advanced Math and Statistics.
Optimal Sports Math, Statistics, and Fantasy, 103–135.
doi:10.1016/b978-0-12-805163-4.00004-9
[4] Jayachandran, S. (2021). The Importance of the Stock Market to the
U.S. Economy. Journal of Finance and Marketing, 10(5).
[5] Malkiel, B. G. (2003). The Efficient Market Hypothesis and Its
Critics. Journal of Economic Perspectives, 17(1), 59-82.
[6] McMillan, J. (2020). Stock Markets Can Indicate How the Economy Is
Doing. The Balance.
[7] Kim, H., & Han, I. (2000). Genetic algorithms approach to feature
discretization in artificial neural networks for the prediction of
stock price index. Expert Systems with Applications, 19(2), 125-
132.
[8] Zhang, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with
artificial neural networks: The state of the art. International
Journal of Forecasting, 14(1), 35-62.
[9] Yao, J., Zhang, L., & Yoo, J. (2005). Forecasting stock market
movement direction with support vector machine. Computers &
Operations Research, 32(10), 2513-2522.
[10] Hong, L., & Yoon, J. (2012). Modeling and forecasting the
volatility of the BRIC stock markets: A regime-switching GARCH
model. Emerging Markets Review, 13(2), 181-198
[11] Data Source
https://www.kaggle.com/datasets/jainilcoder/netflix-stock-price-
prediction/code
[12] Cyril, N. C., Gerard, S. F., Adrian, N. F., Aderonke, A., P.
Kibet, F. Webnda, and Philip, A. A. (2023). "Enhancing Credit Card
Fraud Detection with Regularized Generalized Linear Models: A
Comparative Analysis of Down-Sampling and Up-Sampling Techniques."
International Journal of Innovative Science and Research
Technology (IJISRT), Volume 8(9), 1841-1866. DOI:
10.5281/zenodo.8413849.
[13] Cyril, N. C., Gerard, S. F., Adrian, N. F., Aderonke, A., P.
Kibet, F. Webnda, and Philip, A. A. (2023). "Using Regression
Models to Predict Death Caused by Ambient Ozone Pollution (AOP) in
the United States." International Journal of Innovative Science
and Research Technology (IJISRT), Volume 8(9), 1867-1884. DOI:
10.5281/zenodo.8414044.
[14] Cyril, N. C., Gerard, S. F., Gillian, N., Philip, A. A., Adrian,
N. F., Aderonke, A., P. Kibet, and F. Webnda. (2023). "Time Series
Analysis and Forecasting of COVID-19 Trends in Coffee County,
Tennessee, United States." International Journal of Innovative
Science and Research Technology (IJISRT), Volume 8(9), 2358-2371.
DOI: 10.5281/zenodo.10005806.