ArticlePDF Available

Comparative Analysis of Stock Price Prediction Models: Generalized Linear Model (GLM), Ridge Regression, Lasso Regression, Elasticnet Regression, and Random Forest -A Case Study on Netflix

Authors:

Abstract

The primary objective was to develop a robust model for predicting the adjusted closing price of Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018, including opening price, highest price, lowest price, closing price, adjusted closing price, and trading volume. Our research aims to provide valuable insights and predictive tools that can assist investors and market analysts in making informed decisions. The dataset presented a unique challenge, featuring a diverse mix of quantitative and categorical variables, making it an ideal candidate for a Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM with a gamma(normal) family and a log link function, a suitable choice for modeling positive continuous data with right-skewed distributions. The study also expands beyond the GLM framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and Random Forest models, enabling a comprehensive comparison of their predictive capabilities. Based on the RMSE values, including the Volume variable did not significantly improve the performance of the model in predicting Netflix stock prices. However, the difference between the RMSE values of the two models was small and may not be practically significant. Therefore, it was reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in other scenarios. The analysis of the five models used for predicting the Netflix stock price based on the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net model had the second-best performance, then the Ridge model, followed by the Random Forest Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in predicting the stock price, but the Lasso and Elastic Net models stood out with the best performance. These findings can be useful in guiding investment decisions and risk management strategies in the stock market.
Comparative Analysis of Stock Price Prediction Models: Generalized
Linear Model (GLM), Ridge Regression, Lasso Regression,
Elasticnet Regression, and Random Forest –
A Case Study on Netflix
1Cyril Neba C., , 2Gillian Nsuh, 3Gerard Shu F., 4Philip Amouda A. , 5Adrian Neba F., 6Aderonke Adebisi, 7P. Kibet.,
8F.Webnda
1,4,5,6,7,8 Department of Mathematics and Computer Science, Austin Peay State University, Clarksville, Tennessee, USA
2 School of Business Analytics, University of Quinnipiac, Hamden, Connecticut
3 Montana State University, Gianforte School of Computing, Bozeman, Monatana, USA
Abstract
The primary objective was to develop a robust model for predicting the adjusted closing price of
Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory
Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018,
including opening price, highest price, lowest price, closing price, adjusted closing price, and trading
volume.
Our research aims to provide valuable insights and predictive tools that can assist investors and
market analysts in making informed decisions. The dataset presented a unique challenge, featuring
a diverse mix of quantitative and categorical variables, making it an ideal candidate for a
Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM
with a gamma(normal) family and a log link function, a suitable choice for modeling positive
continuous data with right-skewed distributions. The study also expands beyond the GLM
framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and
Random Forest models, enabling a comprehensive comparison of their predictive capabilities.
Based on the RMSE values, including the Volume variable did not significantly improve the
performance of the model in predicting Netflix stock prices. However, the difference between the
RMSE values of the two models was small and may not be practically significant. Therefore, it was
reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in
other scenarios. The analysis of the five models used for predicting the Netflix stock price based on
the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net
model had the second-best performance, then the Ridge model, followed by the Random Forest
Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in
predicting the stock price, but the Lasso and Elastic Net models stood out with the best
performance. These findings can be useful in guiding investment decisions and risk management
strategies in the stock market.
Keywords:- Stock Price Prediction, Generalized Linear Model (GLM), Ridge Regression, Lasso
Regression, Elasticnet Regression, Random Forest, RMSE, Netflix
1. Background
The stock market plays a pivotal role in the United States' economy, acting as both a barometer
of economic health and a vital driver of economic growth [4]. It serves as a mechanism for
companies to raise capital for expansion, innovation, and job creation. Additionally, it offers
opportunities for individuals to invest and grow their wealth. The stock market is integral to
various aspects of the economy, influencing interest rates, investment decisions, and overall
economic stability [5].
Moreover, the stock market reflects investor sentiment and economic conditions, with indices like
the Dow Jones Industrial Average and the S&P 500 providing insights into market performance
and economic prospects. A thriving stock market often correlates with a robust economy,
increasing consumer confidence and fostering economic growth [6].
However, predicting stock prices in this dynamic environment is challenging. Researchers have
explored various methods, including machine learning techniques, to forecast stock prices
accurately. These efforts aim to provide investors, financial institutions, and policymakers with
valuable insights into market trends and potential risks [7]. Similar machine learning models have
been used on other domains such as credit Card Fraud Detection [12] and Prediction of Death
caused by Ambient Ozone Pollution in the United States [13].
Stock price prediction is a multifaceted task involving the analysis of historical data, market
sentiment, and macroeconomic factors. Machine learning models, such as artificial neural
networks and support vector machines, have been employed to capture complex patterns in
stock price movements [8];[9]. Additionally, models like regime-switching GARCH have been
used to forecast market volatility [10].
The importance of accurate stock price prediction cannot be overstated. Investors rely on
forecasts to make informed decisions regarding buying, selling, or holding stocks. Financial
institutions use these predictions to manage portfolios and assess risk. Moreover, policymakers
monitor stock market trends as part of their economic policymaking.
The stock market therefore holds a central position in the United States' economic landscape,
influencing economic growth, investor sentiment, and economic policies. Predicting stock prices
is a crucial endeavor, and machine learning techniques have emerged as valuable tools for
providing insights into market behavior. These predictions empower investors, financial
institutions, and policymakers to navigate the complex world of stock markets with greater
confidence.
The stock market has consistently held the attention of investors, traders, and analysts due to its
significant influence on financial matters. Gaining insights into the intricacies of stock market
dynamics and formulating forecasts about its future performance are essential for making well-
informed investment choices. Recent years have witnessed a transformation in this arena,
thanks to the availability of extensive datasets and the advancement of sophisticated statistical
models. These developments have not only simplified the process of analyzing stock market
data but have also paved the way for the creation of predictive models that hold the potential to
optimize investment strategies and risk mitigation.
2. Methodolody
Our project revolves around the development and comparison of predictive models for
forecasting the adjusted closing price of Netflix, drawing from historical stock data available on
Kaggle. This dataset furnishes us with a comprehensive snapshot of February 2018, inclusive of
pivotal indicators such as opening and closing prices, high and low points, adjusted closing
prices, and trading volumes for each trading day.
At the heart of our exploration lie several sophisticated regression models and a formidable
machine learning technique, each poised to reveal insights into Netflix's stock price dynamics.
i. Generalized Linear Model (GLM):
The GLM stands at the crossroads of quantitative and categorical predictors, promising a
comprehensive view of Netflix's stock price movements. Rooted in the versatile R programming
language and powered by the glm function, the GLM model will serve as the foundation of our
predictive analysis. Its performance will be meticulously evaluated using established metrics
such as Mean Squared Error and R-squared. The insights derived from the GLM model offer
investors and market analysts valuable tools for understanding stock price behavior.
ii. Ridge Regression:
Ridge Regression, a variant of linear regression, introduces regularization to the model. It is
particularly useful when dealing with multicollinearity, a common issue in financial datasets. By
adding a penalty term, Ridge Regression helps prevent overfitting and provides a more stable
model.
iii. Lasso Regression:
Lasso Regression, another member of the linear regression family, is renowned for its feature
selection capabilities. It can identify the most influential predictors in the dataset and assign them
appropriate weights, promoting a simpler and more interpretable model.
iv. Elastic Net Regression:
Elastic Net Regression combines the strengths of Ridge and Lasso Regression. It provides a
balance between feature selection and regularization, making it adaptable to a wide range of
datasets. In our project, it aids in creating a model that is both interpretable and robust.
v. Random Forest:
Random Forest, a powerful ensemble learning technique, stands as a formidable addition to our
arsenal. Comprising a multitude of decision trees, it harnesses collective wisdom to deliver highly
accurate predictions. Its ability to capture complex interactions and nonlinear relationships within
the data adds depth and adaptability to our predictive modeling efforts.
By subjecting these diverse models to rigorous analysis and comparison, our project aims to
unravel the forces governing Netflix's stock price. These predictive tools, including GLM, Ridge
Regression, Lasso Regression, Elastic Net Regression, and Random Forest, are poised to
illuminate Netflix's future stock performance, offering invaluable insights to investors and
analysts alike.
Our dataset is a medley of predictors, marrying the realms of quantity and category. The
quantitative predictors encompass opening prices, high and low points, and trading volumes,
while the categorical predictor is the date, introducing a temporal dimension to our dataset.
In terms of the response distribution, the gamma(normal) family, coupled with a log link function,
takes center stage. This choice, grounded in statistical theory and affirmed by financial practice,
holds relevance for modeling positively skewed continuous data—a characteristic trait often
exhibited in financial data landscapes, including stock prices, asset returns, and exchange rates
[3]
2.1. Data preparation
The dataset was uploaded into R-studio software and then explored to see the data structure
and dimension which revealed that the dataset is composed of 7 variables or columns and 1009
rows or observations. Inspecting the dataset also revealed that there are no missing values as
shown by Figure 1 below.
Figure 1: Plot showing Missing Values in the Dataset
2.2. Exploratory Data Analysis(EdA)
i. Checking the data for normality and linearity
Scatterplots
Figure 2: Scatterplot Showing Linearity
scatterplots serve as valuable tools to assess the existence of a linear relation between each
predictor variable and the response variable. When the data points on the plot are evenly
distributed along a straight line, it signifies a linear relationship. Conversely, if the points
create a curved pattern, it indicates a non-linear relationship. Upon analyzing the scatterplots
above, it becomes evident that there exists a predominantly linear association between each
predictor variable and the response variable.
Normal Probability Plot of the Residuals
Figure 3: Plot of Normality
The normal probability plot of residuals aids in assessing the normal distribution of
residuals derived from the linear model. A straight-line pattern in the plot suggests that the
residuals exhibit normal distribution. Conversely, if the residuals systematically deviate from
the line, it indicates non-normal distribution. Upon examining the normal probability plot
above, it becomes evident that the residuals approximately adhere to normal distribution,
albeit with some departure from the line at the extremes. This signifies that the conditions
for linearity are satisfied, but the conditions for normality are somewhat violated, a common
occurrence in stock price analysis
Histograms Plot for each variable
Figure 4: Histogram plot for each variable
Upon examining the above histogram plots, it becomes apparent that they display a mild to
moderate right skewness, a common attribute observed in stock price datasets.
2.3. Predicting Netflix Adjusted Closing Price Using a GLM Model
Initially, we divided the dataset into training and testing subsets and subsequently proceeded to
establish a GLM model employing the gamma family and a log link function.
In this particular model, we deviated from the assumption of normality due to the right-skewed
nature of the dataset, as evident in the previously shown histograms. To address this
departure, we opted to model the Netflix stock price data using a log-normal (gamma)
distribution, given its characteristics of positivity and asymmetry.
i. Model 1
The summary output furnishes us with estimated coefficients for each predictor variable,
accompanied by their standard errors, t-values, and p-values. The intercept exhibits an
estimated value of 5.034, which holds statistical significance at the 0.001 level. Meanwhile, the
estimated coefficient for "Open" stands at -0.001447, signifying significance at the 0.01 level.
Conversely, the coefficients for "High" and "Low" portray positivity and hold statistical
significance at the 0.001 level and 0.01 level, respectively. Specifically, "High" and "Low"
possess estimated values of 0.002271 and 0.001507, respectively. However, the coefficient
pertaining to "Volume" lacks significance, featuring an estimated value of -1.000e-09 and a p-
value of 0.23470.
This model summary equips us with the coefficients of each variable, their corresponding
standard errors, t-values, and p-values. These values serve as instrumental tools for deciphering
the relationship between each variable and the Netflix stock price. For instance, a negative
coefficient associated with the "Open" variable signifies that an increase in the Open price is
anticipated to result in a decrease in the Close price, assuming all other variables remain
constant. In a similar vein, a positive coefficient attributed to the "High" variable implies that as
the High price ascends, the Close price is expected to decline, holding other variables steady.
Please note that given the lack of significance in the "Volume" coefficient, we will attempt
to exclude the volume variable and construct another model to assess potential
improvements.
ii. Model 2
2.4. Comparing Model 1 and Model 2
Upon examining the above outputs, it becomes evident that both the first model (Model 1) and
the second model (Model 2) exhibit an identical AIC value of 1600. However, Model 2 boasts a
superior performance in terms of BIC, as it registers a lower value of 1616 in contrast to Model 1,
which bears a higher BIC value of 1620. Consequently, we can reasonably deduce that Model 2
surpasses Model 1 in predictive capability.
It is crucial to acknowledge that a model characterized by a higher log-likelihood (loglik) is
deemed more precise when juxtaposed with a model featuring a lower log-likelihood. Log-
likelihood functions as a pivotal statistical metric employed to gauge the goodness of fit between
a model and the data at hand. Essentially, it quantifies the likelihood of observing the provided
data within the framework of the model's underlying assumptions. A heightened log-likelihood
value signifies that the model aligns more closely with the data, implying that the model is more
plausible as the generator of the observed data. Hence, Model 2, which boasts an elevated log-
likelihood (loglik) of -795, is ascribed a greater degree of accuracy relative to Model 1, which
lodges a diminished log-likelihood (loglik) of -794.
Comparing the RSME values for the Two models
The RMSE for the model excluding the Volume variable (Model 2) stands at 423.45864012155,
marginally edging out the RMSE of 423.45864568846 observed in the model inclusive of the
Volume variable (Model). Nonetheless, this disparity is exceedingly slight and likely lacks
practical significance. Consequently, we can ascertain that the omission of the Volume variable
has failed to yield a substantial enhancement in performance. As a result, we will continue to
employ the model encompassing all variables.
2.5. Predicted Netflix Stock Prices
2.6. Calculating the RSME, R-squared value, MAE by mean of Cross
Validation
i. Perform cross-validation
The RMSE (root mean squared error) is relatively diminutive, standing at 15.27758. This
implies that the model's forecasts closely align with the actual values, exhibiting an average
disparity of approximately 15 units.
An R-squared value of 0.9834965 underscores the model's adeptness in conforming to the
dataset. R-squared serves as an indicator of how well the model elucidates the variability in
the outcome variable, with values converging toward 1 denoting a superior fit. In this instance,
the R-squared figure nearly approaches 1, signifying that the model expounds upon a
substantial portion of the variability within the outcome variable.
The MAE (mean absolute error) also registers as relatively modest, measuring 10.71222. This
metric reflects the average distinction between predicted and actual values, and a lower MAE
signifies that the model's predictions exhibit a greater degree of precision.
2.7.Analyzing the Model Parameters Using Odds Ratios and Calculating the 95%
Confidence Interval for the Odds Ratios.
Analyzing Model Parameters Through Interpretation of the Odds
Ratio
A one-unit increase in the Open Netflix stock price corresponds to a 0.9985536 times
increase in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the High Netflix stock price corresponds to a 1.0022734 times increase
in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the Low Netflix stock price corresponds to a 1.0015079 times increase
in the odds of the Netflix stock Adj.Close price.
A one-unit increase in the Volume of Netflix shares corresponds to a 1.0000000 times
increase in the odds of the Netflix stock Adj.Close price.
3. Applying Regularized GLM Models (Ridge, Lasso, and Elastic Net Regression) for
Forecasting Netflix Stock Prices and Assessing Their Performance in Comparison to
GLM Model_1.
Before building the model, we will take out the date column because date column does not
directly contribute to predicting Netflix stock prices in the dataset, it can still hold value for time
series analysis [14] or the generation of temporal features. However, for prediction purposes, we
will exclude it from the dataset.
Developing Regularized GLM Models (Ridge Regression, Lasso Regression, and
Elasticnet Regression)
i. Ridge Regression
ii. Lasso regression
iii. Elasticnet regression
Developing Random Forest
rf_model <- randomForest(Adj.Close ~ Open + High + Low + Volume, data = train, ntree =
100)
[1] "Root Mean Squared Error (RMSE): 6.4001203741829"
3.1. Comparing RMSE of All the Models
i. GLM 14.303384
ii. Ridge 5.704350
iii. Lasso 3.249638
iv. Elastic Net 3.663382
v. Random Forest: 6.4001203741829
The Lasso model stands out with the lowest RMSE of 3.249638, signifying its superior
predictive performance among the five models.
Following closely, the Elastic Net model also exhibits a low RMSE of 3.663382, securing
its position as the second-best performer among the four models.
In contrast, the Ridge model lags with a higher RMSE of 5.704350, indicating
comparatively weaker predictive performance when compared to the Lasso and Elastic
Net models.
Similarly, the Random Forest model presents a relatively high RMSE of
6.4001203741829.
Lastly, the GLM model trails behind with the highest RMSE of 14.303384, suggesting the
least effective predictive performance among the five models.
Consequently, based on the RMSE values, the Lasso model emerges as the top-
performing model for forecasting the Netflix stock price, followed by the Elastic Net model,
the Ridge model, the Random Forest model, and lastly the GLM model.
Overall, all five models demonstrate accurate predictions, albeit with varying degrees of
precision.
4. Conclusion
Considering the RMSE values, it appears that the inclusion of the Volume variable did not
substantially enhance the model's performance when predicting Netflix stock prices.
Nevertheless, the disparity in RMSE values between the two models is minimal and may not hold
practical significance. Therefore, retaining the Volume variable within the model remains
reasonable, as it may serve as a valuable predictor in other contexts.
For instance, in high-frequency trading scenarios, where stocks change hands in seconds,
trading volume can offer crucial insights into market sentiment and influence stock prices, as
highlighted by [1] and [2]. In such cases, incorporating the Volume variable into the prediction
model can effectively capture the impact of trading volume on stock prices, resulting in more
accurate predictions. Moreover, in situations where investors intend to trade substantial stock
blocks, trading volume can impact stock liquidity, subsequently affecting its price. Hence,
preserving the Volume variable in a financial prediction model holds significance, particularly in
scenarios where trading volume plays a pivotal role in stock price dynamics.
Analyzing the five models used to predict Netflix stock prices based on Root Mean Squared
Errors (RMSE), it becomes evident that the Lasso model demonstrated superior performance,
boasting the lowest RMSE. Following closely is the Elastic Net model, followed by the Ridge
model, then the Random Forest model, with the GLM model lagging behind. Overall, all five
models exhibited a degree of accuracy in forecasting stock prices, with the Lasso and Elastic Net
models excelling. These insights can prove valuable in guiding investment decisions and
formulating risk management strategies within the stock market.
References
[1] Charles Schwab, 2021., "Trading Volume as a Market Indicator."
https://www.schwab.com/learn/story/trading-volume-as-market-
indicator
[2] Fidelity, 2022., "Turn Up the Volume on Stocks.,
https://www.fidelity.com/viewpoints/active-investor/stock-volume
[3] Kissell, R., & Poserina, J. (2017). Advanced Math and Statistics.
Optimal Sports Math, Statistics, and Fantasy, 103–135.
doi:10.1016/b978-0-12-805163-4.00004-9
[4] Jayachandran, S. (2021). The Importance of the Stock Market to the
U.S. Economy. Journal of Finance and Marketing, 10(5).
[5] Malkiel, B. G. (2003). The Efficient Market Hypothesis and Its
Critics. Journal of Economic Perspectives, 17(1), 59-82.
[6] McMillan, J. (2020). Stock Markets Can Indicate How the Economy Is
Doing. The Balance.
[7] Kim, H., & Han, I. (2000). Genetic algorithms approach to feature
discretization in artificial neural networks for the prediction of
stock price index. Expert Systems with Applications, 19(2), 125-
132.
[8] Zhang, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with
artificial neural networks: The state of the art. International
Journal of Forecasting, 14(1), 35-62.
[9] Yao, J., Zhang, L., & Yoo, J. (2005). Forecasting stock market
movement direction with support vector machine. Computers &
Operations Research, 32(10), 2513-2522.
[10] Hong, L., & Yoon, J. (2012). Modeling and forecasting the
volatility of the BRIC stock markets: A regime-switching GARCH
model. Emerging Markets Review, 13(2), 181-198
[11] Data Source
https://www.kaggle.com/datasets/jainilcoder/netflix-stock-price-
prediction/code
[12] Cyril, N. C., Gerard, S. F., Adrian, N. F., Aderonke, A., P.
Kibet, F. Webnda, and Philip, A. A. (2023). "Enhancing Credit Card
Fraud Detection with Regularized Generalized Linear Models: A
Comparative Analysis of Down-Sampling and Up-Sampling Techniques."
International Journal of Innovative Science and Research
Technology (IJISRT), Volume 8(9), 1841-1866. DOI:
10.5281/zenodo.8413849.
[13] Cyril, N. C., Gerard, S. F., Adrian, N. F., Aderonke, A., P.
Kibet, F. Webnda, and Philip, A. A. (2023). "Using Regression
Models to Predict Death Caused by Ambient Ozone Pollution (AOP) in
the United States." International Journal of Innovative Science
and Research Technology (IJISRT), Volume 8(9), 1867-1884. DOI:
10.5281/zenodo.8414044.
[14] Cyril, N. C., Gerard, S. F., Gillian, N., Philip, A. A., Adrian,
N. F., Aderonke, A., P. Kibet, and F. Webnda. (2023). "Time Series
Analysis and Forecasting of COVID-19 Trends in Coffee County,
Tennessee, United States." International Journal of Innovative
Science and Research Technology (IJISRT), Volume 8(9), 2358-2371.
DOI: 10.5281/zenodo.10005806.
... Moreover, our investigation into COVID-19 trends using time series analysis techniques provided valuable insights into epidemiological forecasting, underscoring the importance of advanced analytics in addressing public health challenges [12]. Furthermore, our comparative analysis of stock price prediction models shed light on the performance of different machine learning algorithms in financial forecasting tasks [13]. ...
... Moreover, our investigation into COVID-19 trends using time series analysis techniques provided valuable insights into epidemiological forecasting, underscoring the importance of advanced analytics in addressing public health challenges [12]. Furthermore, our comparative analysis of stock price prediction models shed light on the performance of different machine learning algorithms in financial forecasting tasks [13]. ...
Article
Full-text available
In the rapidly evolving landscape of retail analytics, the accurate prediction of sales figures holds paramount importance for informed decision-making and operational optimization. Leveraging diverse machine learning methodologies, this study aims to enhance the precision of Walmart sales forecasting, utilizing a comprehensive 2 dataset sourced from Kaggle. Exploratory data analysis reveals intricate patterns and temporal dependencies within the data, prompting the adoption of advanced predictive modeling techniques. Through the implementation of linear regression, ensemble methods such as Random Forest, Gradient Boosting Machines (GBM), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), this research endeavors to identify the most effective approach for predicting Walmart sales. Comparative analysis of model performance showcases the superiority of advanced machine learning algorithms over traditional linear models. The results indicate that XGBoost emerges as the optimal predictor for sales forecasting, boasting the lowest Mean Absolute Error (MAE) of 1226.471, Root Mean Squared Error (RMSE) of 1700.981, and an exceptionally high R-squared value of 0.9999900, indicating near-perfect predictive accuracy. This model's performance significantly surpasses that of simpler models such as linear regression, which yielded an MAE of 35632.510 and an RMSE of 80153.858. Insights from bias and fairness measurements underscore the effectiveness of advanced models in mitigating bias and delivering equitable predictions across temporal segments. Our analysis revealed varying levels of bias across different models. Linear Regression, Multiple Regression, and GLM exhibited moderate bias, suggesting some systematic errors in predictions. Decision Tree showed slightly higher bias, while Random Forest demonstrated a unique scenario of negative bias, implying systematic underestimation of predictions. However, models like GBM, XGBoost, and LGB displayed biases closer to zero, indicating more accurate predictions with minimal systematic errors. Notably, the XGBoost model demonstrated the lowest bias, with an MAE of-7.548432 (Table 4), reflecting its superior ability to minimize prediction errors across different conditions. Additionally, fairness analysis revealed that XGBoost maintained robust performance in both holiday and non-holiday periods, with an MAE of 84273.385 for holidays and 1757.721 for non-holidays. Insights from the fairness measurements revealed that Linear Regression, Multiple Regression, and GLM showed consistent predictive performance across both subgroups. Meanwhile, Decision Tree performed similarly for holiday predictions but exhibited better accuracy for non-holiday sales, whereas, Random Forest, XGBoost, GBM, and LGB models displayed lower MAE values for the non-holiday subgroup, indicating potential fairness issues in predicting holiday sales. The study also highlights the importance of model selection and the impact of advanced machine learning techniques on achieving high predictive accuracy and fairness. Ensemble methods like Random Forest and GBM also showed strong performance, with Random Forest achieving an MAE of 12238.782 and an RMSE of 19814.965, and GBM achieving an MAE of 10839.822 and an RMSE of 1700.981. This research emphasizes the significance of leveraging sophisticated analytics tools to navigate the complexities of retail operations and drive strategic decision-making. By utilizing advanced machine learning models, retailers can achieve more accurate sales forecasts, ultimately leading to better inventory management and enhanced operational efficiency. The study reaffirms the transformative potential of data-driven approaches in driving business growth and innovation in the retail sector.
Research
Full-text available
This study conducts a comprehensive time series analysis and forecasting of COVID-19 trends in Coffee County, Tennessee, aiming to understand the pandemic's progression and its implications for public health policy and resource allocation. Utilizing daily reported cases and deaths from official health sources, we apply various time series forecasting techniques, including ARIMA (AutoRegressive Integrated Moving Average), Seasonal Decomposition of Time Series (STL), and Exponential Smoothing State Space Models (ETS), to model the dynamics of COVID-19 infections in the region. We begin by exploring the historical data to identify trends, seasonality, and potential outliers, employing visualizations and statistical tests to assess data characteristics. Subsequently, we implement the ARIMA model, optimizing parameters through auto-correlation and partial auto-correlation functions, alongside evaluating the model's residuals to ensure adequacy. Additionally, the STL decomposition method is used to extract seasonal and trend components, facilitating a clearer understanding of underlying patterns. To enhance forecasting accuracy, we also leverage ETS models, which adaptively smooth the data, capturing changes in trends and seasonal effects effectively. Our results highlight significant fluctuations in case numbers, influenced by various socioeconomic factors and public health interventions throughout the pandemic. The forecasting outcomes provide valuable insights into potential future trends, aiding local health authorities in decision-making processes regarding resource allocation and public health measures. This study underscores the importance of continuous monitoring and adaptive strategies in response to evolving COVID-19 dynamics, contributing to the broader discourse on pandemic preparedness and response at the community level.
Research
Full-text available
Credit card fraud poses a significant threat to financial institutions, resulting in substantial financial losses and eroding consumer trust. Effective detection of fraudulent transactions is crucial for mitigating these risks. This study investigates the performance of regularized Generalized Linear Models (GLMs) in detecting credit card fraud, focusing on the impact of various down-sampling techniques on model accuracy and efficiency. Given the highly imbalanced nature of credit card transaction data, traditional classification methods often struggle to identify fraudulent transactions due to the overwhelming majority of legitimate cases. To address this challenge, we explore several down-sampling strategies, including random down-sampling, Tomek links, and Edited Nearest Neighbors (ENN). Each technique aims to reduce the dataset's size while retaining essential characteristics, thereby enhancing the performance of the regularized GLMs. The effectiveness of these methods is evaluated based on metrics such as precision, recall, F1 score, and area under the ROC curve (AUC). We conduct a comparative analysis of the GLM performance with and without the application of down-sampling techniques, examining how these methods influence the model's ability to detect fraudulent transactions. The findings demonstrate that employing down-sampling techniques significantly improves the performance of regularized GLMs in fraud detection. The study concludes that a strategic combination of regularization methods and down-sampling techniques can enhance the identification of credit card fraud, thereby contributing to the development of more robust and efficient detection systems. This research offers valuable insights for financial institutions seeking to implement effective fraud detection mechanisms while ensuring minimal disruption to legitimate transactions.
Research
Full-text available
This case study investigates the forecasting of Netflix stock prices using various regression and machine learning models, aimed at enhancing predictive accuracy in a dynamic financial environment. As one of the leading streaming services globally, Netflix's stock performance is influenced by numerous factors, including subscriber growth, content investments, and market competition. To analyze these influences, we employ a range of models, including Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and advanced machine learning techniques such as Random Forest and Support Vector Regression (SVR). The study begins by preprocessing historical stock price data, extracting relevant features that may impact price movements, including macroeconomic indicators and company-specific metrics. We then implement the selected models and compare their predictive performance using various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values. Preliminary results indicate that machine learning models, particularly Random Forest and SVR, outperform traditional regression techniques in terms of predictive accuracy, highlighting their ability to capture complex, non-linear relationships in the data. Furthermore, the study examines the importance of feature selection and engineering, demonstrating how tailored predictors can significantly enhance model performance. This research provides valuable insights into the efficacy of different forecasting methods for stock price prediction in the rapidly evolving entertainment sector. By leveraging advanced analytical techniques, investors and analysts can make more informed decisions regarding Netflix stock, ultimately contributing to more effective investment strategies.
Research
Full-text available
This study presents a comparative analysis of various stock price prediction models, specifically focusing on Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and Random Forest. As financial markets become increasingly complex, accurate forecasting of stock prices is critical for investors and financial analysts. Each model offers unique advantages and drawbacks, which can significantly impact prediction accuracy. The analysis begins with an overview of the fundamental principles underlying each model. GLM serves as a flexible tool for modeling the relationship between stock prices and predictor variables, while Ridge and Lasso Regression introduce regularization techniques to mitigate overfitting, enhancing model robustness. Elastic Net combines the strengths of both Ridge and Lasso, making it particularly useful for scenarios with highly correlated features. Random Forest, on the other hand, leverages ensemble learning, constructing multiple decision trees to improve prediction accuracy and handle non-linear relationships effectively. The models are evaluated using historical stock price data, with performance metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values. A series of experiments are conducted to determine the models' predictive power across different time horizons and market conditions. The findings indicate that while Random Forest consistently outperforms traditional regression models in terms of prediction accuracy, the simpler models (GLM, Ridge, Lasso, and Elastic Net) offer interpretability that is valuable for understanding the underlying market dynamics. Ultimately, this study provides insights into selecting appropriate stock price prediction models based on specific analytical needs, paving the way for future research in financial forecasting methodologies.
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Article
Full-text available
Interest in using artificial neural networks (ANNs) for forecasting has led to a tremendous surge in research activities in the past decade. While ANNs provide a great deal of promise, they also embody much uncertainty. Researchers to date are still not certain about the effect of key factors on forecasting performance of ANNs. This paper presents a state-of-the-art survey of ANN applications in forecasting. Our purpose is to provide (1) a synthesis of published research in this area, (2) insights on ANN modeling issues, and (3) the future research directions.
Chapter
This chapter provides an overview of the use of probability and statistics in sports modeling applications. The chapter includes an overview of the important mathematics required for probability and statistics modeling and a review of essential probability distribution functions required for model construction and parameter estimation. The chapter also includes an introduction to different sampling techniques that can be used to test the accuracy of sports prediction models and to correct for data limitation issues which are often present in sports modeling problems. This is primarily due to limited data observations and/or not having games across all pairs of teams.
Article
This paper proposes genetic algorithms (GAs) approach to feature discretization and the determination of connection weights for artificial neural networks (ANNs) to predict the stock price index. Previous research proposed many hybrid models of ANN and GA for the method of training the network, feature subset selection, and topology optimization. In most of these studies, however, GA is only used to improve the learning algorithm itself. In this study, GA is employed not only to improve the learning algorithm, but also to reduce the complexity in feature space. GA optimizes simultaneously the connection weights between layers and the thresholds for feature discretization. The genetically evolved weights mitigate the well-known limitations of the gradient descent algorithm. In addition, globally searched feature discretization reduces the dimensionality of the feature space and eliminates irrelevant factors. Experimental results show that GA approach to the feature discretization model outperforms the other two conventional models.
Article
Revolutions often spawn counterrevolutions and the efficient market hypothesis in finance is no exception. The intellectual dominance of the efficient-market revolution has more been challenged by economists who stress psychological and behaviorial elements of stock-price determination and by econometricians who argue that stock returns are, to a considerable extent, predictable. This survey examines the attacks on the efficient market hypothesis and the relationship between predictability and efficiency. I conclude that our stock markets are more efficient and less predictable than many recent academic papers would have us believe.
Trading Volume as a Market Indicator
  • Charles Schwab
Charles Schwab, 2021., "Trading Volume as a Market Indicator." https://www.schwab.com/learn/ story/trading-volume-as-market-indicator
Turn Up the Volume on Stocks
  • Fidelity
Fidelity, 2022., "Turn Up the Volume on Stocks., https://www.fidelity.com/viewpoints/activeinvestor/stock-volume
The Importance of the Stock Market to the
  • S Jayachandran
Jayachandran, S. (2021). The Importance of the Stock Market to the U.S. Economy. Journal of Finance and Marketing, 10(5).
Stock Markets Can Indicate How the Economy Is Doing. The Balance
  • J Mcmillan
McMillan, J. (2020). Stock Markets Can Indicate How the Economy Is Doing. The Balance.
Modeling and forecasting the volatility of the BRIC stock markets: A regime-switching GARCH model
  • L Hong
  • J Yoon
Hong, L., & Yoon, J. (2012). Modeling and forecasting the volatility of the BRIC stock markets: A regime-switching GARCH model. Emerging Markets Review, 13(2), 181-198