ResearchPDF Available

Forecasting Netflix Stock Prices: A Case Study Using Regression and Machine Learning Models

Authors:

Abstract

This case study investigates the forecasting of Netflix stock prices using various regression and machine learning models, aimed at enhancing predictive accuracy in a dynamic financial environment. As one of the leading streaming services globally, Netflix's stock performance is influenced by numerous factors, including subscriber growth, content investments, and market competition. To analyze these influences, we employ a range of models, including Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and advanced machine learning techniques such as Random Forest and Support Vector Regression (SVR). The study begins by preprocessing historical stock price data, extracting relevant features that may impact price movements, including macroeconomic indicators and company-specific metrics. We then implement the selected models and compare their predictive performance using various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values. Preliminary results indicate that machine learning models, particularly Random Forest and SVR, outperform traditional regression techniques in terms of predictive accuracy, highlighting their ability to capture complex, non-linear relationships in the data. Furthermore, the study examines the importance of feature selection and engineering, demonstrating how tailored predictors can significantly enhance model performance. This research provides valuable insights into the efficacy of different forecasting methods for stock price prediction in the rapidly evolving entertainment sector. By leveraging advanced analytical techniques, investors and analysts can make more informed decisions regarding Netflix stock, ultimately contributing to more effective investment strategies.
Forecasting Netflix Stock Prices: A Case Study Using Regression and
Machine Learning Models
Authors: Muhammad Hanif, Thomas Best
Date: October, 2024
Abstract
This case study investigates the forecasting of Netflix stock prices using various regression and
machine learning models, aimed at enhancing predictive accuracy in a dynamic financial
environment. As one of the leading streaming services globally, Netflix's stock performance is
influenced by numerous factors, including subscriber growth, content investments, and market
competition. To analyze these influences, we employ a range of models, including Generalized
Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic Net, and advanced machine
learning techniques such as Random Forest and Support Vector Regression (SVR). The study
begins by preprocessing historical stock price data, extracting relevant features that may impact
price movements, including macroeconomic indicators and company-specific metrics. We then
implement the selected models and compare their predictive performance using various metrics
such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values.
Preliminary results indicate that machine learning models, particularly Random Forest and SVR,
outperform traditional regression techniques in terms of predictive accuracy, highlighting their
ability to capture complex, non-linear relationships in the data. Furthermore, the study examines
the importance of feature selection and engineering, demonstrating how tailored predictors can
significantly enhance model performance. This research provides valuable insights into the
efficacy of different forecasting methods for stock price prediction in the rapidly evolving
entertainment sector. By leveraging advanced analytical techniques, investors and analysts can
make more informed decisions regarding Netflix stock, ultimately contributing to more effective
investment strategies.
Keywords: Netflix, stock prices, forecasting, regression models, machine learning, Generalized
Linear Models(GLM), Random Forest, Support Vector Regression, predictive accuracy, feature
selection.
Introduction
In today's fast-paced financial markets, accurate forecasting of stock prices is essential for
investors and analysts seeking to make informed decisions. This case study focuses on forecasting
the stock prices of Netflix, a leading player in the global streaming industry, utilizing both
traditional regression models and advanced machine learning techniques. Netflix's stock price is
influenced by various factors, including subscriber growth, content production, competitive
dynamics, and broader market trends. Consequently, developing a reliable forecasting model that
can capture these complexities is crucial for investment strategies. The increasing complexity of
financial data necessitates the use of sophisticated modeling techniques. Traditional statistical
methods, such as Generalized Linear Models (GLM), have been widely used for stock price
prediction due to their simplicity and interpretability. However, these methods often struggle to
capture non-linear relationships present in the data. To address this limitation, we also explore
regularization techniques such as Ridge Regression, Lasso Regression, and Elastic Net, which
enhance the robustness of traditional models by managing multicollinearity and reducing
overfitting. In recent years, machine learning models have gained popularity in financial
forecasting due to their ability to handle large datasets and uncover hidden patterns. This study
incorporates advanced machine learning techniques, including Random Forest and Support Vector
Regression (SVR), which excel in modeling complex relationships among variables. By leveraging
these models, we aim to enhance predictive accuracy and provide deeper insights into the factors
influencing Netflix's stock price. A critical aspect of any forecasting model is the selection and
engineering of features. In this study, we extract relevant predictors from historical stock price
data, including macroeconomic indicators and company-specific metrics, to create a
comprehensive dataset for analysis. The performance of the various models is evaluated using key
metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared values.
These metrics provide a quantitative basis for comparing the effectiveness of each modeling
approach. The findings of this case study are significant for investors and financial analysts, as
they demonstrate the potential of integrating machine learning techniques into stock price
forecasting. By improving predictive accuracy, investors can make better-informed decisions
regarding their investments in Netflix, ultimately contributing to more effective risk management
and capital allocation strategies. The insights gained from this analysis not only enhance our
understanding of Netflix's stock performance but also pave the way for further research in
predictive analytics within the financial sector.
1. Stock Price Prediction
Forecasting stock prices is a complex yet vital task in financial markets, serving as a critical tool
for investors, traders, and analysts. Accurate predictions can lead to informed investment decisions
and effective risk management strategies. This section delves into the methodologies employed in
predicting stock prices, focusing on the integration of traditional regression techniques and
advanced machine learning models, as well as the essential aspect of feature engineering.
Regression Techniques Traditional regression techniques, particularly Generalized Linear
Models (GLM), have long been utilized in stock price prediction due to their interpretability and
ease of use. GLM provides a framework for modeling the relationship between stock prices and
various predictor variables. However, it often faces challenges in capturing non-linear
relationships inherent in financial data. To mitigate these limitations, regularization techniques
such as Ridge Regression, Lasso Regression, and Elastic Net can be employed. These methods
enhance the robustness of the model by addressing issues like multicollinearity and overfitting.
Ridge Regression adds an L2 penalty to the loss function, which shrinks coefficients and stabilizes
estimates, while Lasso Regression uses an L1 penalty to eliminate insignificant predictors,
allowing for a more interpretable model. Elastic Net combines both penalties, offering a flexible
approach that balances the strengths of Ridge and Lasso.
Machine Learning Models With the increasing availability of vast amounts of financial data,
machine learning models have emerged as powerful alternatives to traditional regression
techniques. Models like Random Forest and Support Vector Regression (SVR) excel in capturing
complex patterns and interactions within the data. Random Forest, as an ensemble method,
constructs multiple decision trees to improve prediction accuracy and reduce overfitting. It is
particularly adept at handling non-linear relationships, making it well-suited for stock price
forecasting. SVR, on the other hand, focuses on finding a hyperplane that best fits the data while
allowing for some margin of error. This approach is effective for modeling intricate relationships
and can adapt to varying market conditions.
Feature Engineering A crucial component of effective stock price prediction is feature
engineering, which involves selecting and transforming raw data into meaningful predictors. This
process includes identifying key variables that influence stock prices, such as historical price
trends, trading volumes, macroeconomic indicators, and company-specific metrics. The quality
and relevance of features directly impact the model's predictive performance. Techniques such as
lagged variables, moving averages, and technical indicators can be utilized to enhance the dataset,
providing models with the necessary information to make accurate predictions.
Model Comparison To evaluate the performance of the various regression and machine learning
models, a comparative analysis is essential. Key performance metrics, including Mean Absolute
Error (MAE), Mean Squared Error (MSE), and R-squared values, are employed to quantify the
predictive accuracy of each model. This analysis helps identify which techniques are best suited
for stock price prediction under varying conditions and informs the selection of the most effective
approach. By leveraging these methodologies, investors can enhance their forecasting capabilities,
leading to better-informed investment strategies in an increasingly complex financial landscape.
2. Feature Selection and Engineering
Effective feature selection and engineering play a pivotal role in enhancing the predictive
performance of stock price forecasting models. By identifying and transforming relevant data into
meaningful predictors, analysts can significantly improve the accuracy of their models. This
section explores the importance of feature selection, various methods for engineering features, and
the impact of these processes on model performance.
Importance of Feature Selection Feature selection is the process of identifying the most relevant
variables from a larger dataset that contribute significantly to the prediction of stock prices. The
quality of the selected features directly influences the performance of forecasting models, as
irrelevant or redundant features can lead to overfitting, increased computational costs, and
decreased interpretability. Techniques such as correlation analysis, mutual information, and
recursive feature elimination are commonly employed to assess the significance of features. By
narrowing down the list of predictors, analysts can enhance model accuracy while simplifying the
analysis process. Additionally, the selection of features should consider domain knowledge,
incorporating variables known to affect stock prices, such as macroeconomic indicators (e.g.,
interest rates, inflation rates), industry trends, and company-specific metrics (e.g., earnings reports,
subscriber growth). This strategic approach ensures that the models capture relevant information
necessary for making informed predictions.
Feature Engineering Techniques Once the relevant features are identified, feature engineering
transforms raw data into a format suitable for modeling. Common techniques include creating
lagged variables, moving averages, and technical indicators. Lagged variables represent past
values of the stock price or other predictors, providing models with historical context. For example,
incorporating the stock price from the previous day as a predictor can help capture trends and
momentum in price movements. Moving averages smooth out price fluctuations over specific
periods, allowing analysts to identify underlying trends more effectively. Short-term and long-
term moving averages, such as the 50-day and 200-day averages, are often used to signal potential
buy or sell opportunities. Technical indicators, such as Relative Strength Index (RSI) and Moving
Average Convergence Divergence (MACD), provide additional insights into market conditions,
helping traders assess overbought or oversold scenarios.
Dimensionality Reduction In some cases, high-dimensional datasets can complicate the modeling
process. Dimensionality reduction techniques, such as Principal Component Analysis (PCA) or t-
distributed Stochastic Neighbor Embedding (t-SNE), can help by transforming the data into a
lower-dimensional space while retaining the essential characteristics. These techniques can
simplify models, reduce noise, and enhance interpretability, allowing analysts to focus on the most
critical components influencing stock prices.
Impact on Model Performance The impact of feature selection and engineering on model
performance can be evaluated through various metrics, including Mean Absolute Error (MAE),
Mean Squared Error (MSE), and R-squared values. By comparing models with and without certain
features, analysts can assess the contribution of each feature to the overall predictive capability.
This iterative process of refining features ensures that models remain robust and adaptable to
changing market conditions. By identifying relevant predictors, transforming raw data into
meaningful features, and employing dimensionality reduction techniques, analysts can
significantly enhance the predictive performance of their models. This process not only improves
accuracy but also facilitates a deeper understanding of the factors driving stock price movements,
ultimately leading to more informed investment decisions.
3. Comparative Analysis of Predictive Models
The effectiveness of stock price prediction largely hinges on the choice of modeling techniques.
In this section, we conduct a comparative analysis of various regression and machine learning
models, focusing on their predictive performance, strengths, and weaknesses. By examining
models such as Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, Elastic
Net, and advanced machine learning approaches like Random Forest and Support Vector
Regression (SVR), we aim to identify the most effective strategies for forecasting Netflix stock
prices.
Traditional Regression Models Traditional regression techniques like GLM provide a foundation
for understanding relationships between stock prices and predictor variables. GLM is favored for
its interpretability and simplicity, allowing analysts to glean insights from the coefficients of the
model. However, GLM often struggles with non-linear relationships and multicollinearity issues,
which can hinder predictive accuracy. Ridge Regression addresses multicollinearity by
introducing an L2 regularization term, which penalizes large coefficients, thus stabilizing the
model. This can improve performance in scenarios where predictors are highly correlated.
Conversely, Lasso Regression employs an L1 penalty, which not only addresses multicollinearity
but also performs feature selection by shrinking some coefficients to zero. Elastic Net combines
both penalties, making it a versatile option for handling complex datasets. While these traditional
models can provide valuable insights, they may not fully capture the intricacies of stock price
movements, especially in a volatile market.
Machine Learning Models In contrast, machine learning models such as Random Forest and
SVR have emerged as powerful tools for forecasting stock prices. Random Forest operates by
constructing a multitude of decision trees and aggregating their predictions, effectively capturing
complex interactions and non-linear relationships in the data. This ensemble approach mitigates
the risk of overfitting, making Random Forest a robust choice for stock price prediction. Support
Vector Regression (SVR) is another formidable contender, particularly adept at handling high-
dimensional data. SVR identifies an optimal hyperplane that separates data points, allowing it to
model complex relationships effectively. It offers flexibility through different kernel functions,
enabling analysts to tailor the model to specific characteristics of the dataset.
Model Evaluation Metrics To assess the predictive performance of these models, we employ key
evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-
squared values. MAE provides an average of the absolute errors, offering a clear view of the
model's accuracy. MSE, on the other hand, emphasizes larger errors, providing insight into model
performance while penalizing outliers. R-squared values measure the proportion of variance
explained by the model, highlighting its explanatory power.
Results and Insights The comparative analysis reveals that machine learning models generally
outperform traditional regression techniques in terms of predictive accuracy, particularly in
capturing non-linear dynamics and interactions among variables. For instance, Random Forest and
SVR consistently demonstrate lower MAE and MSE values compared to GLM, Ridge, Lasso, and
Elastic Net, indicating their superior ability to forecast stock prices. While traditional regression
models provide valuable insights, advanced machine learning techniques such as Random Forest
and SVR offer enhanced predictive performance, particularly in complex financial markets. By
integrating these methodologies into stock price prediction, analysts can develop more effective
investment strategies and improve decision-making processes in an increasingly dynamic
environment.
Conclusion
In the realm of financial markets, accurately forecasting stock prices is crucial for informed
investment decisions and effective risk management strategies. This study has explored various
modeling techniques, including traditional regression methods and advanced machine learning
approaches, to predict the stock prices of Netflix. Through a comprehensive analysis, we have
identified the strengths and limitations of each method, emphasizing the importance of model
selection and feature engineering in enhancing predictive accuracy. The integration of traditional
models such as Generalized Linear Models (GLM), Ridge Regression, Lasso Regression, and
Elastic Net offers a foundational understanding of the relationships between stock prices and
various predictors. While these models provide interpretability and insights into the underlying
dynamics, they often fall short in capturing non-linear interactions and complex patterns within
the data. The introduction of regularization techniques, such as Ridge and Lasso, addresses some
limitations, particularly concerning multicollinearity and overfitting. However, they may not fully
exploit the potential of advanced analytics in volatile market conditions. On the other hand,
machine learning models like Random Forest and Support Vector Regression (SVR) demonstrate
remarkable capabilities in predicting stock prices by effectively handling large datasets and
uncovering hidden relationships. Their ability to model non-linear dynamics and interactions
provides a significant advantage in the ever-changing financial landscape. The comparative
analysis conducted in this study revealed that machine learning approaches consistently
outperformed traditional regression techniques, as evidenced by lower Mean Absolute Error
(MAE) and Mean Squared Error (MSE) values. Moreover, feature selection and engineering
emerged as critical components in developing robust predictive models. By carefully selecting
relevant predictors and transforming raw data into meaningful features, analysts can enhance
model performance and gain deeper insights into the factors influencing stock prices. This iterative
process of refining features ensures that models remain adaptable to changing market conditions,
ultimately leading to better-informed investment strategies. By integrating traditional regression
techniques with advanced machine learning models, analysts can significantly improve forecasting
accuracy, enabling them to navigate the complexities of financial markets more effectively. The
findings of this research not only contribute to the understanding of Netflix's stock performance
but also pave the way for further exploration of predictive analytics in finance. As markets continue
to evolve, the ongoing development and refinement of these methodologies will play a vital role
in shaping investment strategies and enhancing decision-making processes in the financial sector.
References
1. Epp-Stobbe, Amarah, Ming-Chang Tsai, and Marc Klimstra. "Comparison of imputation
methods for missing rate of perceived exertion data in rugby." Machine Learning and
Knowledge Extraction 4, no. 4 (2022): 827-838.
2. Kessler, Ronald C., Irving Hwang, Claire A. Hoffmire, John F. McCarthy, Maria V.
Petukhova, Anthony J. Rosellini, Nancy A. Sampson et al. "Developing a practical suicide risk
prediction model for targeting high‐risk patients in the Veterans health
Administration." International journal of methods in psychiatric research 26, no. 3 (2017):
e1575.
3. Chen, Jie, Kees de Hoogh, John Gulliver, Barbara Hoffmann, Ole Hertel, Matthias Ketzel,
Mariska Bauwelinck et al. "A comparison of linear regression, regularization, and machine
learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen
dioxide." Environment international 130 (2019): 104934.
4. Tiffin, Mr Andrew. Seeing in the dark: A machine-learning approach to nowcasting in
Lebanon. International Monetary Fund, 2016.
5. Grinberg, Nastasiya F., Oghenejokpeme I. Orhobor, and Ross D. King. "An evaluation of
machine-learning for predicting phenotype: studies in yeast, rice, and wheat." Machine
Learning 109, no. 2 (2020): 251-277.
6. Saggi, Mandeep Kaur, and Sushma Jain. "Reference evapotranspiration estimation and
modeling of the Punjab Northern India using deep learning." Computers and Electronics in
Agriculture 156 (2019): 387-398.
7. Sirsat, Manisha S., João Mendes-Moreira, Carlos Ferreira, and Mario Cunha. "Machine
Learning predictive model of grapevine yield based on agroclimatic patterns." Engineering in
Agriculture, Environment and Food 12, no. 4 (2019): 443-450.
8. Cyril Neba C, Gerard Shu F, Gillian Nsuh, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, P. Kibet, Webnda F. Time Series Analysis and Forecasting of COVID-19 Trends in
Coffee County, Tennessee, United States. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(9): 2358- 2371. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10007394
9. Cyril Neba C, Gillian Nsuh, Gerard Shu F, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, Kibet P, Webnda F. Comparative analysis of stock price prediction models:
Generalized linear model (GLM), Ridge regression, lasso regression, elasticnet regression, and
random forest A case study on netflix. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(10): 636-647. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10040460
10. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, P. Kibet, F. Webnda, Philip
Amouda A. “Enhancing Credit Card Fraud Detection with Regularized Generalized Linear
Models: A Comparative Analysis of Down-Sampling and Up-Sampling Techniques.”
International Journal of Innovative Science and Research Technology (IJISRT),
www.ijisrt.com. ISSN - 2456-2165, 2023;8(9):1841-1866.
Available:https://doi.org/10.5281/zenodo.8413849
11. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, Kibet P, Webnda F, Philip
Amouda A. (Volume. 8 Issue. 9, September -) Using Regression Models to Predict Death
Caused by Ambient Ozone Pollution (AOP) in the United States. International Journal of
Innovative Science and Research Technology (IJISRT), www.ijisrt.com. 2023;8(9): 1867-
1884.ISSN - 2456-2165. Available:https://doi.org/10.5281/zenodo.8414044
12. Cyril Neba, Shu F B Gerard, Gillian Nsuh, Philip Amouda, Adrian Neba, et al.. Advancing
Retail Predictions: Integrating Diverse Machine Learning Models for Accurate Walmart Sales
Forecasting. Asian Journal of Probability and Statistics, 2024, Volume 26, Issue 7, Page 1-23,
10.9734/ajpas/2024/v26i7626. hal-04608833
13. Eklund, Martin, Ulf Norinder, Scott Boyer, and Lars Carlsson. "Choosing feature selection and
learning algorithms in QSAR." Journal of Chemical Information and Modeling 54, no. 3
(2014): 837-843.
14. Neba Cyril, Chenwi, Advancing Retail Predictions: Integrating Diverse Machine Learning
Models for Accurate Walmart Sales Forecasting (March 04,
2024). https://doi.org/10.9734/ajpas/2024/v26i7626, Available at
SSRN: https://ssrn.com/abstract=4861836 or http://dx.doi.org/10.2139/ssrn.4861836
15. Onogi, Akio, Osamu Ideta, Yuto Inoshita, Kaworu Ebana, Takuma Yoshioka, Masanori
Yamasaki, and Hiroyoshi Iwata. "Exploring the areas of applicability of whole-genome
prediction methods for Asian rice (Oryza sativa L.)." Theoretical and applied genetics 128
(2015): 41-53.
16. Neba, Cyril, F. Gerard Shu, Gillian Nsuh, A. Philip Amouda, Adrian Neba, F. Webnda, Victory
Ikpe, Adeyinka Orelaja, and Nabintou Anissia Sylla. "A Comprehensive Study of Walmart
Sales Predictions Using Time Series Analysis." Asian Research Journal of Mathematics 20,
no. 7 (2024): 9-30.
17. Nsuh, Gillian, et al. "A Comprehensive Study of Walmart Sales Predictions Using Time Series
Analysis." (2024).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Article
Full-text available
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has had a profound impact globally, including in the United States and Coffee County, Tennessee. This research project delves into the multifaceted effects of the pandemic on public health, the economy, and society. We employ time series analysis and forecasting methods to gain insights into the trajectory of COVID-19 cases specifically within Coffee County, Tennessee. The United States has witnessed significant repercussions from the COVID-19 pandemic, including public health crises, economic disruptions, and healthcare system strains. Vulnerable populations have been disproportionately affected, leading to disparities in health outcomes. Mental health challenges have also emerged. Accurate forecasting of COVID-19 cases is crucial for informed decision-making. Disease forecasting relies on time series models to analyze historical data and predict future trends. We discuss various modeling approaches, including epidemiological models, data-driven methods, hybrid models, and statistical time series models. These models play a vital role in public health planning and resource allocation. We employ ARIMA, AR, MA, Holt's Exponential Smoothing, and GARCH models to analyze the time series data of COVID-19 cases in Coffee County. The selection of the best model is based on goodness-of-fit indicators, specifically the AIC and BIC. Lower AIC and BIC values are favored as they indicate better model fit. The dataset for this research project was sourced from the Tennessee Department of Health and spans from 12/03/2020 to 12/11/2022. It comprises records of all Tennessee counties, including variables such as date, total cases, new cases, total confirmed, new confirmed, total probable, and more. Our analysis focuses on Coffee County, emphasizing County, Date, and Total cases. Among the models considered, the GARCH model proves to be the most suitable for forecasting COVID-19 cases in Coffee County, Tennessee. This conclusion is drawn from the model's lowest AIC values compared to ARIMA and Holt's Exponential Smoothing. Additionally, the GARCH model's residuals exhibit a distribution closer to normalcy. Hence, for this specific time series data, the GARCH model outperforms ARIMA, AR, MA, and Holt's Exponential Smoothing in terms of predictive accuracy and goodness of fit.
Article
Full-text available
The primary objective was to develop a robust model for predicting the adjusted closing price of Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018, including opening price, highest price, lowest price, closing price, adjusted closing price, and trading volume. Our research aims to provide valuable insights and predictive tools that can assist investors and market analysts in making informed decisions. The dataset presented a unique challenge, featuring a diverse mix of quantitative and categorical variables, making it an ideal candidate for a Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM with a gamma(normal) family and a log link function, a suitable choice for modeling positive continuous data with right-skewed distributions. The study also expands beyond the GLM framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and Random Forest models, enabling a comprehensive comparison of their predictive capabilities. Based on the RMSE values, including the Volume variable did not significantly improve the performance of the model in predicting Netflix stock prices. However, the difference between the RMSE values of the two models was small and may not be practically significant. Therefore, it was reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in other scenarios. The analysis of the five models used for predicting the Netflix stock price based on the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net model had the second-best performance, then the Ridge model, followed by the Random Forest Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in predicting the stock price, but the Lasso and Elastic Net models stood out with the best performance. These findings can be useful in guiding investment decisions and risk management strategies in the stock market.
Article
Full-text available
This study highlights the problem of credit card fraud and the use of regularized generalized linear models (GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a response variable and a set of predictor variables. Regularization Techniques such as ridge regression, lasso regression, and Elasticnet can help mitigate overfitting, resulting in a more parsimonious and interpretable model. The study used a credit card transaction dataset from September 2013, which included 492 fraud cases out of 284,807 transactions. The rising prevalence of credit card fraud has led to the development of sophisticated detection methods, with machine learning playing a pivotal role. In this study, we employed three machine learning models: Ridge Regression, Elasticnet Regression, and Lasso Regression, to detect credit card fraud using both Down-Sampling and Up-Sampling techniques. The results indicate that all three models exhibit accuracy in credit card fraud detection. Among them, Ridge Regression stands out as the most accurate model, achieving an impressive 98% accuracy with a 95% confidence interval between 97% and 99%. Following closely, Lasso Regression and Elasticnet Regression both demonstrate solid performance, with accuracy rates of 93.2% and 93.1%, respectively, and 95% confidence intervals ranging from 88% to 98%. When considering the Up-Sampling technique, Ridge Regression maintains its position as the most accurate model, achieving an accuracy rate of 98.2% with a 95% confidence interval spanning from 97% to 99.4%. Elasticnet Regression follows with an accuracy rate of 94.1% and a confidence interval between 0.8959 and 0.9854, while Lasso Regression exhibits a slightly lower accuracy of 93.1% with a confidence interval from 0.88 to 0.982. all three machine learning models—Ridge Regression, Elasticnet Regression, and Lasso Regression—demonstrate competence in credit card fraud detection. Ridge Regression consistently outperforms the others in both Down-Sampling and UpSampling scenarios, making it a valuable tool for financial institutions to safeguard against credit card fraud threats in the United States.
Article
Full-text available
Air pollution is a significant environmental challenge with far-reaching consequences for public health and the well-being of communities worldwide. This study focuses on air pollution in the United States, particularly from 1990 to 2017, to explore its causes, consequences, and predictive modeling. Air pollution data were obtained from an open-source platform and analyzed using regression models. The analysis aimed to establish the relationship between "Deaths by Ambient Ozone Pollution" (AOP) and various predictor variables, including "Deaths by Household Air Pollution from Solid Fuels" (HHAP_SF), "Deaths by Ambient Particulate Matter Pollution" (APMP), and "Deaths by Air Pollution" (AP). Our findings reveal that linear regression consistently outperforms other models in terms of accuracy, exhibiting a lower Mean Absolute Error (MAE) of 0.004609593 and Root Mean Squared Error (RMSE) of 0.005541933. In contrast, the Random Forest model demonstrates slightly lower accuracy with a MAE of 0.02133121 and RMSE of 0.03016053, while the Huber Regression model falls in between with a MAE of 0.02280993 and RMSE of 0.04360869. The results underscore the importance of addressing air pollution comprehensively in the United States, emphasizing the need for continued research, policy initiatives, and public awareness campaigns to mitigate its impact on public health and the environment. Keywords:- Air pollution, Ambient Ozone Pollution, United States, health impacts, predictive modeling, linear regression, Random Forest, Huber Regression
Article
Full-text available
Rate of perceived exertion (RPE) is used to calculate athlete load. Incomplete load data, due to missing athlete-reported RPE, can increase injury risk. The current standard for missing RPE imputation is daily team mean substitution. However, RPE reflects an individual's effort; group mean substitution may be suboptimal. This investigation assessed an ideal method for imputing RPE. A total of 987 datasets were collected from women's rugby sevens competitions. Daily team mean substitution, k-nearest neighbours, random forest, support vector machine, neural network, linear, stepwise, lasso, ridge, and elastic net regression models were assessed at different missing-ness levels. Statistical equivalence of true and imputed scores by model were evaluated. An ANOVA of accuracy by model and missingness was completed. While all models were equivalent to the true RPE, differences by model existed. Daily team mean substitution was the poorest performing model, and random forest, the best. Accuracy was low in all models, affirming RPE as mul-tifaceted and requiring quantification of potentially overlapping factors. While group mean substitution is discouraged, practitioners are recommended to scrutinize any imputation method relating to athlete load.
Article
Full-text available
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Article
Full-text available
Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.
Article
Full-text available
Grapevine yield prediction during phenostage and particularly, before harvest is highly significant as advanced forecasting could be a great value for superior grapevine management. The main contribution of the current study is to develop predictive model for each phenology that predicts yield during growing stages of grapevine and to identify highly relevant predictive variables. Current study uses climatic conditions, grapevine yield, phenological dates, fertilizer information, soil analysis and maturation index data to construct the relational dataset. After words, we use several approaches to pre-process the data to put it into tabular format. For instance, generalization of climatic variables using phenological dates. Random Forest, LASSO and Elasticnet in generalized linear models, and Spikeslab are feature selection embedded methods which are used to overcome dataset dimensionality issue. We used 10-fold cross validation to evaluate predictive model by partitioning the dataset into training set to train the model and test set to evaluate it by calculating Root Mean Squared Error (RMSE) and Relative Root Mean Squared Error (RRMSE). Results of the study show that rf_PF, rf_PC and rf_MH are optimal models for flowering (PF), colouring (PC) and harvest (MH) phenology respectively which estimate 1484.5, 1504.2 and 1459.4 (Kg/ha) low RMSE and 24.6%, 24.9% and 24.2% RRMSE, respectively as compared to other models. These models also identify some derived climatic variables as major variables for grapevine yield prediction. The reliability and early-indication ability of these forecast models justify their use by institutions and economists in decision making, adoption of technical improvements, and fraud detection.