ResearchPDF Available

Time Series Analysis and Forecasting of COVID-19 Trends in Coffee County, Tennessee

Authors:

Abstract

This study conducts a comprehensive time series analysis and forecasting of COVID-19 trends in Coffee County, Tennessee, aiming to understand the pandemic's progression and its implications for public health policy and resource allocation. Utilizing daily reported cases and deaths from official health sources, we apply various time series forecasting techniques, including ARIMA (AutoRegressive Integrated Moving Average), Seasonal Decomposition of Time Series (STL), and Exponential Smoothing State Space Models (ETS), to model the dynamics of COVID-19 infections in the region. We begin by exploring the historical data to identify trends, seasonality, and potential outliers, employing visualizations and statistical tests to assess data characteristics. Subsequently, we implement the ARIMA model, optimizing parameters through auto-correlation and partial auto-correlation functions, alongside evaluating the model's residuals to ensure adequacy. Additionally, the STL decomposition method is used to extract seasonal and trend components, facilitating a clearer understanding of underlying patterns. To enhance forecasting accuracy, we also leverage ETS models, which adaptively smooth the data, capturing changes in trends and seasonal effects effectively. Our results highlight significant fluctuations in case numbers, influenced by various socioeconomic factors and public health interventions throughout the pandemic. The forecasting outcomes provide valuable insights into potential future trends, aiding local health authorities in decision-making processes regarding resource allocation and public health measures. This study underscores the importance of continuous monitoring and adaptive strategies in response to evolving COVID-19 dynamics, contributing to the broader discourse on pandemic preparedness and response at the community level.
Time Series Analysis and Forecasting of COVID-19 Trends in Coffee
County, Tennessee
Authors: Fatima Asad, Alexander Noah
Date: November, 2024
Abstract
This study conducts a comprehensive time series analysis and forecasting of COVID-19 trends in
Coffee County, Tennessee, aiming to understand the pandemic's progression and its implications
for public health policy and resource allocation. Utilizing daily reported cases and deaths from
official health sources, we apply various time series forecasting techniques, including ARIMA
(AutoRegressive Integrated Moving Average), Seasonal Decomposition of Time Series (STL), and
Exponential Smoothing State Space Models (ETS), to model the dynamics of COVID-19
infections in the region. We begin by exploring the historical data to identify trends, seasonality,
and potential outliers, employing visualizations and statistical tests to assess data characteristics.
Subsequently, we implement the ARIMA model, optimizing parameters through auto-correlation
and partial auto-correlation functions, alongside evaluating the model's residuals to ensure
adequacy. Additionally, the STL decomposition method is used to extract seasonal and trend
components, facilitating a clearer understanding of underlying patterns. To enhance forecasting
accuracy, we also leverage ETS models, which adaptively smooth the data, capturing changes in
trends and seasonal effects effectively. Our results highlight significant fluctuations in case
numbers, influenced by various socio-economic factors and public health interventions throughout
the pandemic. The forecasting outcomes provide valuable insights into potential future trends,
aiding local health authorities in decision-making processes regarding resource allocation and
public health measures. This study underscores the importance of continuous monitoring and
adaptive strategies in response to evolving COVID-19 dynamics, contributing to the broader
discourse on pandemic preparedness and response at the community level.
Keywords: COVID-19, time series analysis, forecasting, Coffee County, Tennessee, ARIMA,
STL, Exponential Smoothing, public health, pandemic trends.
Introduction
The COVID-19 pandemic has had unprecedented impacts worldwide, affecting public health
systems, economies, and daily life. Understanding the dynamics of the virus's spread is essential
for effective management and response strategies. This study focuses on Coffee County,
Tennessee, a region that, like many others, has experienced significant challenges due to the
pandemic. By employing time series analysis and forecasting techniques, we aim to provide
insights into COVID-19 trends in this locality, aiding public health officials in decision-making
processes. Time series analysis is a powerful statistical tool used to analyze data points collected
or recorded at specific time intervals. It allows researchers to identify patterns, trends, and seasonal
variations within the data. In the context of COVID-19, time series analysis can help in
understanding the progression of cases and deaths over time, facilitating the identification of
significant trends that may inform public health interventions. This study utilizes various
forecasting methods, including AutoRegressive Integrated Moving Average (ARIMA), Seasonal
Decomposition of Time Series (STL), and Exponential Smoothing State Space Models (ETS). The
ARIMA model is particularly useful for non-stationary time series data, as it incorporates both
autoregressive and moving average components, making it suitable for modeling the intricate
patterns often observed in infectious disease data. Meanwhile, the STL method allows for the
decomposition of time series data into trend, seasonal, and residual components, providing a
clearer understanding of underlying behaviors. In addition to ARIMA and STL, we also employ
ETS models, which adaptively smooth the data and can capture shifts in trends and seasonal effects
more effectively. By applying these techniques to the daily reported cases and deaths of COVID-
19 in Coffee County, we aim to forecast future trends and potential outbreaks, thereby equipping
local health authorities with the necessary information for proactive planning and resource
allocation. Furthermore, this research emphasizes the importance of localized data analysis in
understanding the pandemic's impact on specific communities. The findings can help guide
targeted public health interventions, ensuring that resources are allocated effectively to mitigate
the effects of COVID-19.
COVID-19 Trends in Coffee County
Understanding the trends of COVID-19 in specific regions is critical for effective public health
responses. In Coffee County, Tennessee, analyzing the trajectory of the virus's spread allows local
health authorities to implement timely interventions. The examination of COVID-19 trends
involves identifying fluctuations in case numbers and mortality rates over time, highlighting peak
periods of infection and potential correlations with public health measures.
Time Series Analysis Time series analysis is instrumental in studying COVID-19 trends. This
method involves analyzing data collected at regular time intervals to uncover patterns and make
future predictions. For Coffee County, we collect daily reports of COVID-19 cases and deaths to
construct a time series dataset. By employing statistical techniques, we can discern seasonal
variations, long-term trends, and irregularities, offering a comprehensive view of the pandemic's
progression in the region. This analysis is crucial for understanding how the virus has spread,
identifying periods of increased transmission, and recognizing the impact of interventions such as
mask mandates or vaccination campaigns.
ARIMA Model One of the primary tools utilized in this analysis is the AutoRegressive Integrated
Moving Average (ARIMA) model. ARIMA is well-suited for forecasting non-stationary time
series data, like that of COVID-19, as it accounts for both autoregressive (AR) components and
moving averages (MA). The integration aspect of ARIMA allows it to handle trends by
differencing the data, making it stationary. By optimizing the parameters of the ARIMA model,
we can effectively capture the underlying patterns of COVID-19 cases in Coffee County, providing
reliable forecasts of future trends.
Seasonal Decomposition In conjunction with the ARIMA model, Seasonal Decomposition of
Time Series (STL) is employed to dissect the COVID-19 data into its fundamental components:
trend, seasonality, and residuals. This decomposition enables us to visualize the underlying
patterns more clearly and understand the seasonal fluctuations that may influence the spread of the
virus. For instance, certain periods may exhibit higher transmission rates due to seasonal behaviors
or holiday gatherings, which can be crucial for public health planning.
Forecasting Techniques Forecasting COVID-19 trends is essential for anticipating future
outbreaks and preparing health resources accordingly. By leveraging both ARIMA and STL, we
can generate forecasts that help inform local public health officials about potential surges in cases.
This proactive approach is vital for managing hospital capacities, implementing timely
interventions, and protecting vulnerable populations within Coffee County. By utilizing ARIMA
and STL methods, we can provide valuable insights into the progression of the pandemic in Coffee
County, guiding authorities in their efforts to mitigate the impact of COVID-19 on the community.
Forecasting COVID-19 Cases
Accurate forecasting of COVID-19 cases is essential for effective public health response and
resource management. In Coffee County, Tennessee, forecasting models provide crucial insights
into potential future trends, helping health authorities prepare for possible surges in cases. This
section explores the methodologies used for forecasting COVID-19 cases, including the ARIMA
model and Exponential Smoothing State Space Models (ETS).
Utilizing the ARIMA Model The ARIMA model serves as a foundational tool for forecasting
COVID-19 cases in Coffee County. By analyzing historical case data, the ARIMA model captures
underlying trends and seasonality, enabling accurate predictions of future case numbers. The first
step in using the ARIMA model involves assessing the stationarity of the time series data.
Stationarity is a critical assumption for ARIMA, as the model requires that the statistical properties
of the series remain constant over time. If the data is non-stationary, we employ differencing
techniques to transform it into a stationary series. Once we establish stationarity, we identify
appropriate parameters for the ARIMA model, denoted as (p, d, q), where "p" represents the
autoregressive order, "d" is the degree of differencing, and "q" is the moving average order.
Utilizing the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF)
plots, we can optimize these parameters for the best fit to the historical data. Following this, we
validate the model by analyzing its residuals to ensure they are white noise, confirming the
adequacy of our forecasts.
Exponential Smoothing State Space Models (ETS) In addition to ARIMA, we also apply
Exponential Smoothing State Space Models (ETS) for forecasting COVID-19 cases. The ETS
methodology is particularly effective in situations where data exhibits trend and seasonality. This
approach assigns exponentially decreasing weights to past observations, allowing recent data
points to have a more significant influence on forecasts. The ETS model adapts to changes in the
underlying data, making it responsive to shifts in case dynamics. The key advantage of ETS models
lies in their simplicity and interpretability. By decomposing the time series into error, trend, and
seasonal components, health officials can better understand the factors driving case fluctuations.
This understanding is critical for implementing targeted public health measures based on predicted
trends.
Combining Forecasting Approaches Combining forecasts from both ARIMA and ETS models
can enhance overall accuracy and reliability. By comparing the results from different models, we
can leverage the strengths of each approach to arrive at a more informed prediction of COVID-19
cases in Coffee County. This ensemble method allows for cross-validation, reducing the likelihood
of overfitting to any one model while providing a comprehensive view of potential future case
trajectories.
Impact of Forecasting on Public Health Decisions
The ability to accurately forecast COVID-19 trends is crucial for informed public health decision-
making. In Coffee County, Tennessee, the insights gained from time series analysis and forecasting
models directly influence the strategies implemented by health authorities to manage the pandemic
effectively. This section examines the significant impacts of forecasting on public health decisions,
focusing on resource allocation, intervention timing, and community engagement.
Resource Allocation Effective resource allocation is paramount during a health crisis. By
leveraging forecasting models, public health officials can anticipate future case surges and allocate
resources accordingly. For instance, if models predict an increase in COVID-19 cases, authorities
can ensure that hospitals are prepared with adequate staffing, medical supplies, and equipment.
Forecasting helps identify potential hotspots within Coffee County, allowing officials to deploy
resources to areas most likely to experience increased transmission. This proactive approach is
essential in preventing healthcare systems from becoming overwhelmed, ultimately ensuring that
all patients receive timely and appropriate care.
Timing of Interventions The timing of public health interventions is another critical aspect
influenced by forecasting. By understanding potential future trends, health authorities can
implement measures such as mask mandates, social distancing guidelines, or vaccination
campaigns at the most effective moments. For example, if the forecasting indicates a potential rise
in cases during specific seasons or following major public gatherings, officials can act
preemptively to mitigate the spread of the virus. This foresight enables a more strategic approach
to public health interventions, maximizing their effectiveness and minimizing disruptions to daily
life in Coffee County.
Community Engagement and Communication Forecasting also plays a vital role in community
engagement and communication. Transparent communication of forecasted trends helps build
public trust and encourages compliance with health guidelines. By sharing insights derived from
forecasting models, health authorities can inform the community about potential risks and the
rationale behind certain interventions. This open dialogue fosters a sense of collective
responsibility, as community members are more likely to adhere to public health measures when
they understand the reasoning behind them. Furthermore, engaging the community in discussions
about forecasts can enhance awareness and preparedness, empowering individuals to take
proactive steps to protect their health.
Tailored Public Health Strategies Finally, the insights gained from forecasting allow for the
development of tailored public health strategies that address the unique needs of Coffee County.
Different communities may experience COVID-19 trends differently based on factors such as
demographics, socioeconomic conditions, and local behaviors. By utilizing localized forecasting
data, health authorities can design interventions that resonate with the specific characteristics of
their populations, enhancing the likelihood of compliance and effectiveness. By enabling informed
resource allocation, timely interventions, and effective community engagement, forecasting
models serve as essential tools for managing the ongoing pandemic. As we continue to navigate
the complexities of COVID-19, the importance of data-driven decision-making cannot be
overstated, emphasizing the need for robust forecasting practices to safeguard public health.
Conclusion
In summary, the time series analysis and forecasting of COVID-19 trends in Coffee County,
Tennessee, highlight the critical role that data-driven approaches play in managing public health
crises. By employing sophisticated statistical models like ARIMA and Exponential Smoothing
State Space Models (ETS), we can gain valuable insights into the progression of the pandemic,
enabling health authorities to anticipate future trends and make informed decisions. The findings
of this study underscore the necessity of localized data analysis, which is essential for
understanding the unique challenges faced by specific communities. Effective forecasting allows
for proactive resource allocation, ensuring that healthcare facilities are equipped to handle
potential surges in cases. By predicting increases in COVID-19 cases, public health officials can
prepare hospitals with the necessary staffing, equipment, and supplies, thereby preventing
overwhelming the healthcare system. Additionally, the timing of public health interventions is
significantly enhanced through accurate forecasting. By understanding when to implement
measures such as mask mandates or vaccination drives, authorities can optimize their strategies,
maximizing their impact on controlling the virus's spread. Moreover, forecasting serves as a
powerful tool for community engagement. Transparent communication of predicted trends fosters
public trust and encourages adherence to health guidelines. By sharing insights from forecasting
models, health officials can inform the community about the risks associated with COVID-19,
leading to increased compliance and cooperation in public health measures. This collaborative
effort is crucial for effectively combating the pandemic. The insights derived from these models
not only guide resource allocation and intervention timing but also empower communities through
informed decision-making and enhanced engagement. As we continue to navigate the complexities
of the pandemic, the importance of robust forecasting practices will remain essential in
safeguarding public health and ensuring the well-being of communities like Coffee County.
Ultimately, embracing data-driven methodologies will enable us to respond more effectively to
current and future public health challenges, underscoring the necessity for ongoing research and
adaptation in our strategies to combat infectious diseases.
References
1. Epp-Stobbe, Amarah, Ming-Chang Tsai, and Marc Klimstra. "Comparison of imputation
methods for missing rate of perceived exertion data in rugby." Machine Learning and
Knowledge Extraction 4, no. 4 (2022): 827-838.
2. Kessler, Ronald C., Irving Hwang, Claire A. Hoffmire, John F. McCarthy, Maria V.
Petukhova, Anthony J. Rosellini, Nancy A. Sampson et al. "Developing a practical suicide risk
prediction model for targeting high‐risk patients in the Veterans health
Administration." International journal of methods in psychiatric research 26, no. 3 (2017):
e1575.
3. Chen, Jie, Kees de Hoogh, John Gulliver, Barbara Hoffmann, Ole Hertel, Matthias Ketzel,
Mariska Bauwelinck et al. "A comparison of linear regression, regularization, and machine
learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen
dioxide." Environment international 130 (2019): 104934.
4. Tiffin, Mr Andrew. Seeing in the dark: A machine-learning approach to nowcasting in
Lebanon. International Monetary Fund, 2016.
5. Grinberg, Nastasiya F., Oghenejokpeme I. Orhobor, and Ross D. King. "An evaluation of
machine-learning for predicting phenotype: studies in yeast, rice, and wheat." Machine
Learning 109, no. 2 (2020): 251-277.
6. Saggi, Mandeep Kaur, and Sushma Jain. "Reference evapotranspiration estimation and
modeling of the Punjab Northern India using deep learning." Computers and Electronics in
Agriculture 156 (2019): 387-398.
7. Sirsat, Manisha S., João Mendes-Moreira, Carlos Ferreira, and Mario Cunha. "Machine
Learning predictive model of grapevine yield based on agroclimatic patterns." Engineering in
Agriculture, Environment and Food 12, no. 4 (2019): 443-450.
8. Cyril Neba C, Gerard Shu F, Gillian Nsuh, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, P. Kibet, Webnda F. Time Series Analysis and Forecasting of COVID-19 Trends in
Coffee County, Tennessee, United States. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(9): 2358- 2371. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10007394
9. Cyril Neba C, Gillian Nsuh, Gerard Shu F, Philip Amouda A, Adrian Neba F, Aderonke
Adebisi, Kibet P, Webnda F. Comparative analysis of stock price prediction models:
Generalized linear model (GLM), Ridge regression, lasso regression, elasticnet regression, and
random forest A case study on netflix. International Journal of Innovative Science and
Research Technology (IJISRT). 2023;8(10): 636-647. www.ijisrt.com. ISSN - 2456-2165.
Available:https://doi.org/10.5281/zenodo.10040460
10. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, P. Kibet, F. Webnda, Philip
Amouda A. “Enhancing Credit Card Fraud Detection with Regularized Generalized Linear
Models: A Comparative Analysis of Down-Sampling and Up-Sampling Techniques.”
International Journal of Innovative Science and Research Technology (IJISRT),
www.ijisrt.com. ISSN - 2456-2165, 2023;8(9):1841-1866.
Available:https://doi.org/10.5281/zenodo.8413849
11. Cyril Neba C, Gerard Shu F, Adrian Neba F, Aderonke Adebisi, Kibet P, Webnda F, Philip
Amouda A. (Volume. 8 Issue. 9, September -) Using Regression Models to Predict Death
Caused by Ambient Ozone Pollution (AOP) in the United States. International Journal of
Innovative Science and Research Technology (IJISRT), www.ijisrt.com. 2023;8(9): 1867-
1884.ISSN - 2456-2165. Available:https://doi.org/10.5281/zenodo.8414044
12. Cyril Neba, Shu F B Gerard, Gillian Nsuh, Philip Amouda, Adrian Neba, et al.. Advancing
Retail Predictions: Integrating Diverse Machine Learning Models for Accurate Walmart Sales
Forecasting. Asian Journal of Probability and Statistics, 2024, Volume 26, Issue 7, Page 1-23,
10.9734/ajpas/2024/v26i7626. hal-04608833
13. Eklund, Martin, Ulf Norinder, Scott Boyer, and Lars Carlsson. "Choosing feature selection and
learning algorithms in QSAR." Journal of Chemical Information and Modeling 54, no. 3
(2014): 837-843.
14. Neba Cyril, Chenwi, Advancing Retail Predictions: Integrating Diverse Machine Learning
Models for Accurate Walmart Sales Forecasting (March 04,
2024). https://doi.org/10.9734/ajpas/2024/v26i7626, Available at
SSRN: https://ssrn.com/abstract=4861836 or http://dx.doi.org/10.2139/ssrn.4861836
15. Onogi, Akio, Osamu Ideta, Yuto Inoshita, Kaworu Ebana, Takuma Yoshioka, Masanori
Yamasaki, and Hiroyoshi Iwata. "Exploring the areas of applicability of whole-genome
prediction methods for Asian rice (Oryza sativa L.)." Theoretical and applied genetics 128
(2015): 41-53.
16. Neba, Cyril, F. Gerard Shu, Gillian Nsuh, A. Philip Amouda, Adrian Neba, F. Webnda, Victory
Ikpe, Adeyinka Orelaja, and Nabintou Anissia Sylla. "A Comprehensive Study of Walmart
Sales Predictions Using Time Series Analysis." Asian Research Journal of Mathematics 20,
no. 7 (2024): 9-30.
17. Nsuh, Gillian, et al. "A Comprehensive Study of Walmart Sales Predictions Using Time Series
Analysis." (2024).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis Original Research Article Neba et al.; Asian Res. Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
Article
Full-text available
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has had a profound impact globally, including in the United States and Coffee County, Tennessee. This research project delves into the multifaceted effects of the pandemic on public health, the economy, and society. We employ time series analysis and forecasting methods to gain insights into the trajectory of COVID-19 cases specifically within Coffee County, Tennessee. The United States has witnessed significant repercussions from the COVID-19 pandemic, including public health crises, economic disruptions, and healthcare system strains. Vulnerable populations have been disproportionately affected, leading to disparities in health outcomes. Mental health challenges have also emerged. Accurate forecasting of COVID-19 cases is crucial for informed decision-making. Disease forecasting relies on time series models to analyze historical data and predict future trends. We discuss various modeling approaches, including epidemiological models, data-driven methods, hybrid models, and statistical time series models. These models play a vital role in public health planning and resource allocation. We employ ARIMA, AR, MA, Holt's Exponential Smoothing, and GARCH models to analyze the time series data of COVID-19 cases in Coffee County. The selection of the best model is based on goodness-of-fit indicators, specifically the AIC and BIC. Lower AIC and BIC values are favored as they indicate better model fit. The dataset for this research project was sourced from the Tennessee Department of Health and spans from 12/03/2020 to 12/11/2022. It comprises records of all Tennessee counties, including variables such as date, total cases, new cases, total confirmed, new confirmed, total probable, and more. Our analysis focuses on Coffee County, emphasizing County, Date, and Total cases. Among the models considered, the GARCH model proves to be the most suitable for forecasting COVID-19 cases in Coffee County, Tennessee. This conclusion is drawn from the model's lowest AIC values compared to ARIMA and Holt's Exponential Smoothing. Additionally, the GARCH model's residuals exhibit a distribution closer to normalcy. Hence, for this specific time series data, the GARCH model outperforms ARIMA, AR, MA, and Holt's Exponential Smoothing in terms of predictive accuracy and goodness of fit.
Article
Full-text available
The primary objective was to develop a robust model for predicting the adjusted closing price of Netflix, leveraging historical stock price data sourced from Kaggle. Through in-depth Exploratory Data Analysis, we examined a dataset encompassing essential daily metrics for February 2018, including opening price, highest price, lowest price, closing price, adjusted closing price, and trading volume. Our research aims to provide valuable insights and predictive tools that can assist investors and market analysts in making informed decisions. The dataset presented a unique challenge, featuring a diverse mix of quantitative and categorical variables, making it an ideal candidate for a Generalized Linear Model (GLM). To address the characteristics of the data, we employed a GLM with a gamma(normal) family and a log link function, a suitable choice for modeling positive continuous data with right-skewed distributions. The study also expands beyond the GLM framework by incorporating Ridge Regression, Lasso Regression, Elasticnet Regression, and Random Forest models, enabling a comprehensive comparison of their predictive capabilities. Based on the RMSE values, including the Volume variable did not significantly improve the performance of the model in predicting Netflix stock prices. However, the difference between the RMSE values of the two models was small and may not be practically significant. Therefore, it was reasonable to keep the Volume variable in the model as it could potentially be a useful predictor in other scenarios. The analysis of the five models used for predicting the Netflix stock price based on the Root mean Squared Errors showed that the Lasso model performed the best. The Elastic Net model had the second-best performance, then the Ridge model, followed by the Random Forest Model and finally the GLM model. Overall, all five models demonstrated some level of accuracy in predicting the stock price, but the Lasso and Elastic Net models stood out with the best performance. These findings can be useful in guiding investment decisions and risk management strategies in the stock market.
Article
Full-text available
This study highlights the problem of credit card fraud and the use of regularized generalized linear models (GLMs) to detect fraud. GLMs are flexible statistical frameworks that model the relationship between a response variable and a set of predictor variables. Regularization Techniques such as ridge regression, lasso regression, and Elasticnet can help mitigate overfitting, resulting in a more parsimonious and interpretable model. The study used a credit card transaction dataset from September 2013, which included 492 fraud cases out of 284,807 transactions. The rising prevalence of credit card fraud has led to the development of sophisticated detection methods, with machine learning playing a pivotal role. In this study, we employed three machine learning models: Ridge Regression, Elasticnet Regression, and Lasso Regression, to detect credit card fraud using both Down-Sampling and Up-Sampling techniques. The results indicate that all three models exhibit accuracy in credit card fraud detection. Among them, Ridge Regression stands out as the most accurate model, achieving an impressive 98% accuracy with a 95% confidence interval between 97% and 99%. Following closely, Lasso Regression and Elasticnet Regression both demonstrate solid performance, with accuracy rates of 93.2% and 93.1%, respectively, and 95% confidence intervals ranging from 88% to 98%. When considering the Up-Sampling technique, Ridge Regression maintains its position as the most accurate model, achieving an accuracy rate of 98.2% with a 95% confidence interval spanning from 97% to 99.4%. Elasticnet Regression follows with an accuracy rate of 94.1% and a confidence interval between 0.8959 and 0.9854, while Lasso Regression exhibits a slightly lower accuracy of 93.1% with a confidence interval from 0.88 to 0.982. all three machine learning models—Ridge Regression, Elasticnet Regression, and Lasso Regression—demonstrate competence in credit card fraud detection. Ridge Regression consistently outperforms the others in both Down-Sampling and UpSampling scenarios, making it a valuable tool for financial institutions to safeguard against credit card fraud threats in the United States.
Article
Full-text available
Air pollution is a significant environmental challenge with far-reaching consequences for public health and the well-being of communities worldwide. This study focuses on air pollution in the United States, particularly from 1990 to 2017, to explore its causes, consequences, and predictive modeling. Air pollution data were obtained from an open-source platform and analyzed using regression models. The analysis aimed to establish the relationship between "Deaths by Ambient Ozone Pollution" (AOP) and various predictor variables, including "Deaths by Household Air Pollution from Solid Fuels" (HHAP_SF), "Deaths by Ambient Particulate Matter Pollution" (APMP), and "Deaths by Air Pollution" (AP). Our findings reveal that linear regression consistently outperforms other models in terms of accuracy, exhibiting a lower Mean Absolute Error (MAE) of 0.004609593 and Root Mean Squared Error (RMSE) of 0.005541933. In contrast, the Random Forest model demonstrates slightly lower accuracy with a MAE of 0.02133121 and RMSE of 0.03016053, while the Huber Regression model falls in between with a MAE of 0.02280993 and RMSE of 0.04360869. The results underscore the importance of addressing air pollution comprehensively in the United States, emphasizing the need for continued research, policy initiatives, and public awareness campaigns to mitigate its impact on public health and the environment. Keywords:- Air pollution, Ambient Ozone Pollution, United States, health impacts, predictive modeling, linear regression, Random Forest, Huber Regression
Article
Full-text available
Rate of perceived exertion (RPE) is used to calculate athlete load. Incomplete load data, due to missing athlete-reported RPE, can increase injury risk. The current standard for missing RPE imputation is daily team mean substitution. However, RPE reflects an individual's effort; group mean substitution may be suboptimal. This investigation assessed an ideal method for imputing RPE. A total of 987 datasets were collected from women's rugby sevens competitions. Daily team mean substitution, k-nearest neighbours, random forest, support vector machine, neural network, linear, stepwise, lasso, ridge, and elastic net regression models were assessed at different missing-ness levels. Statistical equivalence of true and imputed scores by model were evaluated. An ANOVA of accuracy by model and missingness was completed. While all models were equivalent to the true RPE, differences by model existed. Daily team mean substitution was the poorest performing model, and random forest, the best. Accuracy was low in all models, affirming RPE as mul-tifaceted and requiring quantification of potentially overlapping factors. While group mean substitution is discouraged, practitioners are recommended to scrutinize any imputation method relating to athlete load.
Article
Full-text available
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Article
Full-text available
Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.
Article
Full-text available
Grapevine yield prediction during phenostage and particularly, before harvest is highly significant as advanced forecasting could be a great value for superior grapevine management. The main contribution of the current study is to develop predictive model for each phenology that predicts yield during growing stages of grapevine and to identify highly relevant predictive variables. Current study uses climatic conditions, grapevine yield, phenological dates, fertilizer information, soil analysis and maturation index data to construct the relational dataset. After words, we use several approaches to pre-process the data to put it into tabular format. For instance, generalization of climatic variables using phenological dates. Random Forest, LASSO and Elasticnet in generalized linear models, and Spikeslab are feature selection embedded methods which are used to overcome dataset dimensionality issue. We used 10-fold cross validation to evaluate predictive model by partitioning the dataset into training set to train the model and test set to evaluate it by calculating Root Mean Squared Error (RMSE) and Relative Root Mean Squared Error (RRMSE). Results of the study show that rf_PF, rf_PC and rf_MH are optimal models for flowering (PF), colouring (PC) and harvest (MH) phenology respectively which estimate 1484.5, 1504.2 and 1459.4 (Kg/ha) low RMSE and 24.6%, 24.9% and 24.2% RRMSE, respectively as compared to other models. These models also identify some derived climatic variables as major variables for grapevine yield prediction. The reliability and early-indication ability of these forecast models justify their use by institutions and economists in decision making, adoption of technical improvements, and fraud detection.