ArticlePDF Available

Forecasting Significant Stock Market Price Changes Using Machine Learning: Extra Trees Classifier Leads

MDPI
Electronics
Authors:

Abstract and Figures

Predicting stock market fluctuations is a difficult task due to its intricate and ever-changing nature. To address this challenge, we propose an approach to minimize forecasting errors by utilizing a classification-based technique, which is a widely used set of algorithms in the field of machine learning. Our study focuses on the potential effectiveness of this approach in improving stock market predictions. Specifically, we introduce a new method to predict stock returns using an Extra Trees Classifier. Technical indicators are used as inputs to train our model while the target is the percentage difference between the closing price and the closing price after 10 trading days for 120 companies from various industries. The 10-day time frame strikes a good balance between accuracy and practicality for traders, avoiding the low accuracy of short time frames and the impracticality of longer ones. The Extra Trees Classifier algorithm is ideal for stock market predictions because of its ability to handle large data sets with a high number of input features and improve model robustness by reducing overfitting. Our results show that our Extra Trees Classifier model outperforms the more traditional Random Forest method, achieving an accuracy of 86.1%. These findings suggest that our model can effectively predict significant price changes in the stock market with high precision. Overall, our study provides valuable insights into the potential of classification-based techniques in enhancing stock market predictions.
Content may be subject to copyright.
Citation: Pagliaro, A. Forecasting
Significant Stock Market Price
Changes Using Machine Learning:
Extra Trees Classifier Leads.
Electronics 2023,12, 4551. https://
doi.org/10.3390/electronics12214551
Academic Editor: Qian Yin
Received: 28 September 2023
Revised: 3 November 2023
Accepted: 4 November 2023
Published: 6 November 2023
Copyright: © 2023 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
electronics
Article
Forecasting Significant Stock Market Price Changes Using
Machine Learning: Extra Trees Classifier Leads
Antonio Pagliaro 1,2,3
1INAF IASF Palermo, Via Ugo La Malfa, 153, 90146 Palermo, Italy; antonio.pagliaro@inaf.it
2Istituto Nazionale di Fisica Nucleare Sezione di Catania, Via Santa Sofia, 64, 95123 Catania, Italy
3ICSC—Centro Nazionale di Ricerca in HPC, Big Data e Quantum Computing, 40121 Bologna, Italy
Abstract:
Predicting stock market fluctuations is a difficult task due to its intricate and ever-changing
nature. To address this challenge, we propose an approach to minimize forecasting errors by utilizing
a classification-based technique, which is a widely used set of algorithms in the field of machine
learning. Our study focuses on the potential effectiveness of this approach in improving stock market
predictions. Specifically, we introduce a new method to predict stock returns using an Extra Trees
Classifier. Technical indicators are used as inputs to train our model while the target is the percentage
difference between the closing price and the closing price after 10 trading days for 120 companies from
various industries. The 10-day time frame strikes a good balance between accuracy and practicality
for traders, avoiding the low accuracy of short time frames and the impracticality of longer ones. The
Extra Trees Classifier algorithm is ideal for stock market predictions because of its ability to handle
large data sets with a high number of input features and improve model robustness by reducing
overfitting. Our results show that our Extra Trees Classifier model outperforms the more traditional
Random Forest method, achieving an accuracy of 86.1%. These findings suggest that our model can
effectively predict significant price changes in the stock market with high precision. Overall, our
study provides valuable insights into the potential of classification-based techniques in enhancing
stock market predictions.
Keywords: Random Forest; Extra Trees; machine learning; stock price forecasting
1. Introduction
The role of the stock market in today’s economy is invaluable. Investors are constantly
seeking ways to anticipate the prices of their preferred stocks. With precise predictions,
investors stand to gain substantial profits in the stock exchange. However, this task is not a
simple one. The stock market is highly volatile and susceptible to various factors, including
political events, economic conditions, trader attitudes, and behaviors, which can change
abruptly. Additionally, due to its nonlinear and intricate nature, predicting stock prices
remains a significant challenge.
The widely embraced Random Walk hypothesis [
1
] and the Efficient Market Hypothe-
sis [
2
] have established the notion that forecasting the stock market is an impractical task.
The Wisdom of Crowd hypothesis proposes that collective opinions may offer accurate
predictions, but it has demonstrated limited efficacy in forecasting stock market returns.
Additionally, the intricate nature of stock market prediction is further complicated by multi-
ple variables and uncertainties that impact stock prices. However, despite these challenges,
certain individuals and institutional investors have managed to surpass the market and
achieve profitable outcomes [3].
The existence of predictable patterns in stock prices has been acknowledged by finan-
cial economists and statisticians, leading to controversial claims that investors can earn
excess risk-adjusted returns [
4
]. However, statistical dependencies giving rise to momen-
tum are exceedingly small, making it difficult for investors to realize excess returns. While
Electronics 2023,12, 4551. https://doi.org/10.3390/electronics12214551 https://www.mdpi.com/journal/electronics
Electronics 2023,12, 4551 2 of 23
certain anomalies in stock returns exist, they are typically modeled within specific contexts
and cannot be generalized. Trying to guess where stock markets are headed has been a
highly sought-after and tough-to-solve problem for both investors and researchers. Experts
in the field look at stock market trends with the help of sophisticated mathematics, com-
puter science, economics, and other areas of knowledge. Notably, Technical Analysis, Time
Series Forecasting, Machine Learning and Data Mining [
5
], and modeling and predicting
volatility of stocks using differential equations [
6
] are methodologies used to predict stock
price behavior.
In summary, stock market prediction is an important yet challenging task due to the
complex and dynamic nature of financial markets. While various techniques have been
applied to forecast stock prices and returns, there remains room for improvement in terms
of predictive accuracy. This manuscript explores the potential of using a machine learning
approach, which utilizes data mining techniques, to enhance stock market predictions. This
aligns with the position advocated by Widom [7].
There has been a surge in researching and utilizing machine learning methods to
predict stock prices in recent years, sparked by the influential work of Hellstrom and
Holmstromm [
5
]. For a literature review see [
8
]. These include models of an artificial neural
network, the support vector machines (SVMs) [
9
], the autoregressive integrated moving
average (ARIMA) [
10
], the adaptive exponential smoothing and multiple regression [
11
].
More recently, Basak et al. [
12
] showed that Random Forests and gradient boosted decision
trees are an improvement over previous predictions methods.
The main aim of this study is to utilize advanced machine learning models, some of
which we already used in other fields of research [
13
,
14
], to develop an accurate prediction
model for short-term stock market returns using an Extra Trees Classifier algorithm. Specif-
ically, we predict if the return after 10 trading days (2 calendar weeks) will be significant
by analyzing previous periods. The key objectives are: (1) to construct an Extra Trees
Classifier model optimized for stock return predictions; (2) to evaluate its performance
against benchmark models such as Random Forests; and (3) to assess its viability as an
improved technique for stock market forecasting compared to traditional methods.
The key hypothesis is that an Extra Trees Classifier model can predict short-term
stock returns more accurately compared to benchmark methods like Random Forests and
traditional regression models.
Prior studies utilizing machine learning for stock prediction have focused predom-
inantly on regression-based models. Classification models have been relatively under-
explored despite their strengths in handling high-dimensional data and reducing over-
fitting. This presents a gap in evaluating classification algorithms like Extra Trees for
stock market prediction. Extra Trees specifically overcomes drawbacks of Random Forests
through faster training and robustness against noise, making it well-suited for complex
financial data.
The paper is structured as follows: Section 2covers methodology: training data pre-
processing for classification and a description of all technical indicators evaluated.
Section 3
provides an in-depth description of the dataset. Section 4describes the sell/hold/buy
strategy. Section 5provides a description of models evaluated and estimation of features
importance. The results are presented in Section 6and include evaluation metrics and
experimental results of simulated sells/buys following the signals from our model, and
trading recommendations figures. A discussion follows. Additional information is in the
Appendix A.
Our contributions include: (1) demonstrating the viability of a classification-based
approach for stock forecasting compared to traditional regression models; (2) introducing a
robust Extra Trees model optimized for stock market data; and (3) providing comparative
evidence on the superior performance of Extra Trees over Random Forest algorithms for
this task.
Electronics 2023,12, 4551 3 of 23
2. Methodology
The Extra Trees model is constructed using technical indicators derived from historical
price data as input features and actual percentage returns over a 10-day period as targets.
The model hyperparameters are tuned using grid search to optimize predictive performance.
The dataset encompasses stocks from diverse sectors to improve generalizability.
Our method involves a comprehensive approach to data retrieval and preprocessing.
To retrieve the data, we used a script that iterated through each symbol in a list of tickers,
setting the start and end dates and downloading the data. We then calculated all the
relevant technical indicators for each company before writing the data to a file for further
analysis. The data were preprocessed to clean the data of any non-numeric values.
These indicators serve as a means of predicting future stock behavior and are used as
features for training classifiers. This rigorous approach ensured that our data set was both
accurate and reliable, providing a solid foundation for our subsequent analysis. The follow-
ing section provides a detailed account of the techniques and technical indicators employed.
2.1. Training Data Processing
Data preparation is an important step in a machine learning method. It involves
cleaning, transforming, and selecting the relevant data for the analysis. In this case, the
features are technical parameters and the target is the market direction. These features are
selected based on their relevance to the problem and their ability to provide meaningful
insights. The target is used to train the machine learning model, so it must be defined
clearly. When preparing the data for our model, the focus is on the target variable which
represents the percentage difference in the stock price after a period of 10 trading days, or
equivalently, 2 calendar weeks. In the realm of statistical analysis and machine learning, it
is however necessary to categorize continuous variables into discrete groups to facilitate
analysis. In this particular scenario, the target variable has been sorted into three classes
utilizing pre-established bin boundaries, as explained in Section 4. Section 3provides an
in-depth description of the data set used.
2.2. Technical Indicators
In trees classifiers, feature selection refers to the process of determining which features,
or input variables, are most important for making predictions. Technical indicators are
mathematical calculations that are used to analyze and predict the behavior of financial
markets, including stock markets. They are based on the historical price and/or volume
data of a security and are used to identify trends, market conditions, and potential buy or
sell opportunities.
The technical indicators used as features in our methods are the followings. It is worth
noting that the technical indicators do not take into account any fundamental factors that
might influence the price of an asset. Where not otherwise specified, period n is 14. For the
computations, we used Pandas TA—A Technical Analysis Library in Python 3 [15].
2.2.1. Momentum Indicators
1.
MACD. The Moving Average Convergence Divergence (MACD) is a technical indica-
tor developed by Gerald Appel [
16
] that is used to measure the strength and direction
of a trend.
The formula to calculate the Moving Average Convergence Divergence (MACD) is:
MACD(x,n,m) =EMA(x,n) EMA(x,m)
where: EMA(x,n) =Exponential Moving Average of x with period n;
n=number of periods for the fast moving average ;
m=number of periods for the slow moving average .
In our case, m=26 and n=12.
Electronics 2023,12, 4551 4 of 23
2.
RSI. The Relative Strength Index (RSI) is a momentum oscillator that compares the
magnitude of a stock’s recent gains to the magnitude of its recent losses in an attempt
to determine overbought and oversold conditions of an asset.
The RSI is calculated by taking the average gains of an asset over a certain number of
periods, typically 14, and dividing it by the average losses over the same period. This
results in a ratio, which is then converted into a number between 0 and 100. A reading
above 70 is typically considered to indicate an overbought condition, while a reading
below 30 is considered to indicate an oversold condition. The RSI being above 70 is
termed an “overbought condition” because it signals the uptrend is exhausted and
due for a reversal. While the asset may seem profitable in the short-term when RSI
exceeds 70, the indicator is warning that upside momentum is unsustainable after
such a sharp rise without pullbacks. The terminology “overbought” reflects the idea
that the asset’s price has been bought up too heavily, making a trend reversal likely
ahead despite short-term profits. The formula to calculate the Relative Strength Index
(RSI) is:
RSI =100 100
(1+RS)
where:
RS =Average gain of up periods
Average loss of down periods;
Average gain of up periods =n
i=1Ui
n;
Average loss of down periods =n
i=1Di
n;
Uiis the gain of up periods;
Diis the loss of down periods;
nis the number of periods.
3.
TSI. The True Strength Index (TSI) is used to measure the strength of a trend in a
financial market. TSI is similar to the Relative Strength Index (RSI) but it attempts to
reduce the lag of the RSI by double-smoothing the data.
The TSI is calculated by first taking the rate of change of a security’s closing price
and then applying a double-smoothing process to the resulting value. The double-
smoothing process involves calculating a simple moving average of the rate of change,
and then calculating a second simple moving average of the first moving average.
Traders often use the TSI to identify potential overbought and oversold conditions in
the market, by looking for bullish or bearish divergences between the TSI and the price
action of the financial instrument. Bullish divergences occur when the TSI is making
higher lows while the price is making lower lows, indicating that the price is likely to
rise. Bearish divergences occur when the TSI is making lower highs while the price is
making higher highs, indicating that the price is likely to fall. The TSI is calculated by
taking the difference between the current closing price and the exponential moving
average of the closing price, divided by the exponential moving average of the closing
price, multiplied by 100. This results in an indicator that oscillates around the zero
line, with values above zero indicating bullish momentum and values below zero
indicating bearish momentum. The formula to calculate the TSI is:
TSIi=100 ×(CloseiEMAi)
EMAi
, where: EMAi=Closei+ (n1)×EMAi1
n
where:
TSIi
is the True Strength Index value for period
i
;
Closei
is the closing price for
period
i
;
EMAi
is the exponential moving average of the closing price for period
i
; n
is the number of periods.
4.
SLOPE TSI. The slope of the True Strength Index value is computed as numerical
derivative of the True Strength Index. The slope of TSI is useful to identify the trend
of the TSI, if it is increasing or decreasing.
Electronics 2023,12, 4551 5 of 23
5.
RVGI. The Relative Vigor Index (RVGI) is used to measure the strength of a financial
instrument’s price action. The RVGI is calculated by first taking the difference between
the current closing price and the previous closing price, and then calculating a moving
average of these differences. The resulting value is then divided by the sum of the
absolute differences between the current and previous closing prices, multiplied
by 100.
The formula to calculate the Relative Vigor Index (RVGI) is:
RVGIi=Ui
Ui+Di
Ui1
Ui1+Di1
where:
Ui
is the Up period close;
Di
is the Down period close;
i
is the current period;
i1 is the previous period.
The RVGI indicator is a volatility-adjusted momentum oscillator that oscillates be-
tween
1 and 1, with values above 0 indicating bullish momentum and values below
0 indicating bearish momentum.
6.
STC. The Schaff Trend Cycle (STC) is a combination of three different indicators: the
cyclical component of the moving average (MACD), the rate of change (ROC), and a
double-smoothed stochastic oscillator.
The STC indicator uses the cyclical component of the MACD to identify the underlying
trend in the market and the ROC and double-smoothed stochastic oscillator to identify
short-term price movements. The indicator generates a signal when the short-term
price movements diverge from the underlying trend, which can indicate a potential
trend change.
The formula for the STC is not publicly available as it is a proprietary indicator created
by Doug Schaff. However, we used the adaption from Pandas TA library [15,17].
7.
SLOPE STC. This feature is the slope of the exponential weighted moving average of
the STC value.
8.
Williams %R. The Williams %R [
18
], also known as the Williams Overbought/Oversold
Index, is a momentum oscillator that measures overbought and oversold levels in the
stock market. The formula for the Williams %R is given by:
Williams %R =100 Highest High Close
Highest High Lowest Low (1)
where
Highest High
is the highest high for the period being analyzed,
Close
is the
closing price, and
Lowest Low
is the lowest low for the period being analyzed. The
Williams %R oscillates between 0 and
100, with values close to
100 indicating
oversold conditions and values close to 0 indicating overbought conditions. In the
following, Williams %R will be referred as WILLR.
9.
CFO. The Chande Forecast Oscillator (CFO) is a momentum oscillator that measures
the strength of price trends in the financial markets. It is calculated as the difference
between the sum of recent gains and the sum of recent losses divided by the sum
of all price changes over a given period. The CFO oscillates between positive and
negative values, with positive values indicating a bullish market and negative values
indicating a bearish market.
The formula for the Chande Forecast Oscillator is as follows:
CFO =n
i=1(UiDi)
n
i=1(Ui+Di)
where:
Ui
is the gain of up periods;
Di
is the loss of down periods; n is the number
of periods.
2.2.2. Overlap Indicators
1.
SLOPE VWMA. The Volume-Weighted Moving Average (VWMA) is a technical
indicator that combines the traditional moving average with volume data to give
Electronics 2023,12, 4551 6 of 23
more emphasis to periods of high trading activity. It is similar to a simple moving
average but instead of equal weight to all prices, it gives more weight to periods with
higher trading volumes.
The VWMA is calculated by taking the sum of the product of the closing price and the
volume for each period, divided by the sum of the volume over the same period. This
results in an average price that is more representative of the prices that were actually
traded during the period.
The formula to calculate the Volume Weighted Moving Average (VWMA) is:
VWMA =n
i=1CloseiVolumei
n
i=1Volumei
where:
Closei
is the closing price for period
i
;
Volumei
is the trading volume for period
i; n is the number of periods.
Our feature is the gradient of the VWMA.
2.
FWMA. The Fractal Weighted Moving Average (FWMA) is a variation of the simple
moving average but it gives more weight to fractal patterns in the data. The FWMA
is calculated by taking the sum of the product of the closing price and the weight of
each fractal for each period, divided by the sum of the weights over the same period.
The formula to calculate the Fractal Weighted Moving Average (FWMA) is:
FWMA =n
i=1CloseiWi
n
i=1Wi
where:
Closei
is the closing price for period
i
;
Wi
is the weight of fractal for period
i
;
n
is the number of periods.
2.2.3. Trend Indicators
1.
ADX. The Average Directional Index (ADX) was calculated using a combination of
three other indicators: the Plus Directional Indicator (+DI), the Minus Directional
Indicator (DI), and the Average True Range (ATR).
The +DI and
DI indicators are used to measure the strength of bullish and bearish
movements, respectively, while the ATR is used to measure volatility.
They are calculated as follows: First, you need to calculate the True Range (TR) for
each period. The True Range is the greatest of the following three values:
TR =max(high low, |high previous close|,|low previous close|)
The Plus Directional Movement measures the upward movement in price. It is
calculated as follows:
+DM =(ch ph, if (ch ph)>(pl cl)
0, otherwise
where:
ch = current high;
ph = previous high;
cl = current low;
pl = previous low.
The Minus Directional Movement measures the downward movement in price. It is
calculated as follows:
DM =(pl cl, if (pl cl)>(ch ph)
0, otherwise
The Average True Range (ATR) measures market volatility and is calculated as an
average of the TR values over a specified period.
Electronics 2023,12, 4551 7 of 23
Now we can calculate the Plus Directional Indicator (+DI) and Minus Directional
Indicator (DI) as follows:
+DI =14-period EMA of +DM
ATR
DI =14-period EMA of DM
ATR
Here, EMA stands for Exponential Moving Average, and the “14-period” indicates
that you typically use a 14-period EMA for these calculations. You calculate EMA by
giving more weight to recent data points, which helps to smooth out the values and
make the indicators more responsive to recent price movements.
The ADX is then calculated by taking the absolute difference between the +DI and
DI and dividing it by the sum of the +DI and -DI, multiplied by the ATR.
The formula to calculate the Average Directional Index (ADX) is:
ADX =|+DI DI|
+DI +DI ×ATR
where: +DI is the Plus Directional Indicator;
DI is the Minus Directional Indicator;
ATR is the Average True Range.
The ADX is a measure of the overall trend strength, combining the strength of both
bullish and bearish movements while taking market volatility into account using the
ATR. It helps traders and analysts identify the strength of a trend and whether it is
worth trading.
2.
AROON. The Aroon is used to measure the strength of a trend and the likelihood of
its continuation, and consists of two lines, the Aroon Up line and the Aroon Down
line. The Aroon Up line measures the strength of the uptrend, while the Aroon Down
line measures the strength of the downtrend.
The Aroon Up line is calculated by taking the number of periods since the highest
high divided by the total number of periods. The Aroon Down line is calculated
by taking the number of periods since the lowest low divided by the total number
of periods.
Values for Aroon Up and Aroon Down oscillate between 0 and 100, with readings
near 100 indicating a strong trend in the direction of the oscillator and readings near 0
indicating a weak trend or no trend at all.
Aroon Up and Aroon Down indicators can be used in combination to identify potential
trend changes. An increasing Aroon Up and a decreasing Aroon Down indicate a
strong uptrend, while a decreasing Aroon Up and an increasing Aroon Down suggest
a strong downtrend. A weak trend or no trend at all is indicated when both Aroon Up
and Aroon Down decrease.
Additionally, a buy signal is generated when Aroon Up crosses Aroon Down from
below, and a sell signal is generated when Aroon Down crosses Aroon Up from above.
The formula for calculating the Aroon Up line is:
Aroon Up =(Days Since Highest High)
N×100
where:
Nis the number of periods;
Days Since Highest High = N Days Since N Period High
and Days Since N
Period High is the number of periods that have passed since the
highest high within the last N periods. It is calculated by counting the number of
periods between the current period and the period when the highest high occurred.
In this case, N is set to 14.
The formula for calculating the Aroon Down line is:
Electronics 2023,12, 4551 8 of 23
Aroon Down =(Days Since Lowest Low)
N×100
where:
Nis the number of periods;
Days Since Lowest Low = N Days Since N Period Low
and Days Since N -Period Low is the number of periods since the lowest low. Our
feature is computed as:
Aroon =Aroon Up Aroon Down
2.2.4. Volatility Indicators
1.
Bollinger Bands. Bollinger Bands are used to measure the volatility of a financial
instrument. The indicator consists of three lines: the simple moving average line,
which is the middle band; and an upper and lower band. The upper band is plotted
two standard deviations above the simple moving average, while the lower band is
plotted two standard deviations below the simple moving average.
Bollinger Bands are often used to identify potential overbought and oversold condi-
tions in the market. When the price of a financial instrument moves above the upper
band, it is considered overbought, and when it moves below the lower band, it is con-
sidered oversold. Traders can also use Bollinger Bands to identify potential breakouts
by looking for price action to break above or below the upper or lower bands.
Traders also use Bollinger Bands as a volatility indicator, by using the width between
the upper and lower bands. The wider the bands, the higher the volatility, and the
narrower the bands, the lower the volatility.
The formula to calculate the three Bollinger Bands are:
Middle Band = SMA(n)
Upper Band = SMA(n) + k·standard deviation(n)
Lower Band = SMA(n)k·standard deviation(n)
where:
SMA(n)
is the Simple Moving Average with period
n
;
standard deviation(n)
is the standard deviation of
x
with period
n
;
n
is the number of periods for the moving
average. In our case
n=
14;
k
is the number of standard deviations from the moving
average to plot the upper and lower bands. In our case k=2.
In the following, medium, upper, and lower Bollinger bands will be referred to as
BOLL M, BOLL U, and BOLL L.
2.
RVI. The Relative Volatility Index (RVI) is used to measure the volatility of an asset
relative to its own recent price history. The RVI formula is defined as follows:
RVI =EMA(|ClosetCloset1|)
where EMA is the Exponential Moving Average,
Closet
is the closing price at time t,
and
Closet1
is the closing price at time
t
1. The RVI ranges from 0 to 100 and is
typically used to identify overbought and oversold conditions in the market.
3.
Price from Donchian. The Donchian channel is a moving average of the highest
high and the lowest low prices over a certain period of time and is typically plotted
on a chart as three lines: the upper line represents the highest high price over the
specified time period, the middle line represents the average of the highest high over
the specified time period, and the lower line represents the lowest low price over the
specified time period.
The price of the asset oscillates between these two bands, and when the price breaks
above the upper band, it is considered to be in an uptrend, and when it breaks below
the lower band, it is considered to be in a downtrend.
The two bands are calculated as follows:
Upper band = maxn
i=1Highi
Lower band = minn
i=1Lowi
where:
Highi
is the highest price for period
i
;
Lowi
is the lowest price for period
i
;
n
is
the number of periods. In our case: n=28.
Electronics 2023,12, 4551 9 of 23
Our feature is the distance of the closing price from the mean of the Donchian channel
and is computed as follow:
DCt=Closing PricetMean (U pper Bandt,Lower Bandt)
Mean (Upper Bandt,Lower Bandt)
where:
Closing Pricet
is the closing price at time t;
Upper Bandt
is the upper band at
time t; Lower Bandtis the lower band at time t; Mean(x,y)is the mean of x and y.
In the following, this feature will be referred as PF DONCHIAN.
4.
Slope of Price from Donchian. This feature is the slope of the previous indicator. In
the following, this feature will be referred to as SLOPE PFD.
2.2.5. Volume Indicators
1.
A/D. Accumulation/Distribution (A/D) measures the buying and selling pressure
of a financial instrument. The A/D indicator is calculated by taking the difference
between the current close and the previous close, multiplied by the volume. If the
current close is above the previous close, the value is positive, indicating that money
is flowing into the stock.
The formula to calculate the Accumulation/Distribution is:
ADi=ADi1+ (CloseiClosei1)×Volumei
where:
ADi
is the Accumulation/Distribution value for period
i
;
Closei
is the closing
price for period
i
;
Closei1
is the closing price for period
i
1;
Volumei
is the trading
volume for period i.
This formula calculates the A/D by taking the difference between the current close and
the previous close, multiplied by the volume, and adding the result to the previous
A/D value. If the current close is above the previous close, the value is positive,
indicating that money is flowing into the stock, and therefore it is considered a
buying pressure. If the current close is below the previous close, the value is negative,
indicating that money is flowing out of the stock, and therefore it is considered a
selling pressure.
2. SLOPE AD. This feature is the slope of the previous indicator.
3.
CMF. The Chaikin Money Flow (CMF) is based on the Accumulation/Distribution
line and is calculated by taking the difference between the high and low prices for
each period, multiplied by the volume, and dividing the sum of these values by the
sum of the volume over the same period. This results in an indicator that oscillates
around the zero line, with values above zero indicating buying pressure and values
below zero indicating selling pressure.
The formula to calculate the Chaikin Money Flow (CMF) is:
CMF =n
i=1[(CloseiLowi)(HighiClosei)] Vol umei
n
i=1Volumei
where:
Closei
is the closing price for period
i
;
Lowi
is the lowest price for period
i
;
Highi
is the highest price for period
i
;
Volumei
is the trading volume for period
i
;
n
is the number of periods. This formula calculates the CMF by taking the difference
between the high and low prices for each period, multiplied by the volume, and
dividing the sum of these values by the sum of the volume over the same period.
2.2.6. More Indicators
1.
SLOPE A50. The feature SLOPE A50 is computed applying the Exponentially Weighted
Moving Average to the closing prices using a window of 50 days, and calculating
the gradient.
2. SLOPE A23. Same as the previous feature, with a window of 23 days.
Electronics 2023,12, 4551 10 of 23
2.3. Exponential Smoothing
For the computation of the features STC, SLOPE STC, SLOPE A50, and SLOPE A23, the
time series historical stock data was first exponentially smoothed. Exponential smoothing
is a time series method that uses a weighted average of past observations. The weights
decrease exponentially as the observations get older, hence the name “exponential smoothing”.
There are several different types of exponential smoothing, including:
Simple exponential smoothing: This method uses a single smoothing factor to give
more weight to recent observations and less weight to older observations.
Holt’s linear exponential smoothing: This method adds a trend component to simple
exponential smoothing, allowing for the prediction of both level and trend in the data.
Holt–Winters exponential smoothing: This method adds a seasonal component to
Holt’s linear exponential smoothing, allowing for the prediction of level, trend, and
seasonality in the data.
The choice of which type of exponential smoothing to use depends on the character-
istics of the time series data. Simple exponential smoothing is best for data with no clear
trend or seasonality. This is our case. Holt’s linear exponential smoothing is best for data
with a clear trend but no seasonality, and Holt–Winters exponential smoothing is best for
data with both a clear trend and seasonality.
The determination of the smoothing factor (alpha) for exponential smoothing involves
minimizing the sum of squared errors between predicted and actual values.
For our scenario, the simple exponential smoothing of a series Y can be calculated
recursively using the following formula:
St=αYt+ (1α)St1for t1
Here, the smoothing factor, denoted by 0
<α<
1, is used to reduce the level of
smoothing. A larger value of
α
results in less smoothing, whereas at
α=
1, the smoothed
statistic becomes equal to the observed value. Smoothing eliminates random fluctuations
from historical data, thereby making it easier to identify the long-term trend of a stock
price. We use α=0.9.
3. Data Set
The analysis presented in this paper aims to predict the percentage difference between
the closing price and the closing price after 10 trading days (2 calendar weeks). The data
used for the analysis includes 120 companies (listed in the Appendix A), covering the time
period from 1 January 1995 to the end of 2022. All available data for these companies were
used, from day 1 (or from their IPO, if they were listed after 1 January 1995) until the end of
the aforementioned time period, or until the date in which they were delisted. The sample
of companies was chosen randomly, without any strict criteria regarding their background
or economic impact. The companies in the sample represent a diverse range of industries,
including software, electronics, and pharmaceutical companies, among others.
The raw data used in the analysis includes the date of entry, closing price, and volume,
among other variables. The target variable for the analysis is the percentage difference
between the closing price and the closing price after 10 trading days (2 calendar weeks).
The selection of a suitable time frame for evaluating trading strategies is a critical
component of technical analysis. Short time frames, such as 3 or 5 days, may not allow for
a reliable assessment of a strategy’s accuracy due to the high unpredictability of short-term
price movements (in Basak et al. [
12
], Random Forest models have been evaluated for
various trading windows, including 3, 5, 10, 15, 30, 60, and 90 days, and have shown an
increase in accuracy with longer windows. Interestingly, accuracies for windows less than
10 days were only slightly greater than 50%). This is because short-term price movements
can be erratic and unpredictable, resulting in low accuracy levels when using these time
frames. On the other hand, longer time frames can provide more significant trends, but
they may not be practical for traders who want to make multiple trades in a month.
Electronics 2023,12, 4551 11 of 23
Thus, the choice of a 10-day time frame can be considered a good compromise between
accuracy and practicality. This time frame enables traders to evaluate the performance
of their strategy within a reasonable time frame that allows for more predictable price
movements while still being practical practical for traders who wish to make frequent
trades. Additionally, this time frame aligns with the two-week cycle of many economic
indicators and news events that can impact market movements.
It is worth noting that the choice of a time frame for technical analysis can vary
depending on the specific asset being analyzed and the trading strategy being employed.
The code used to retrieve the data involved iterating through each symbol in the list
of tickers, setting the start and end dates, and downloading the data. The downloaded
data were then preprocessed and cleaned of any non-numeric values. Next, all the tech-
nical indicators listed before for each company were computed and written to a file for
further analysis.
All feature values were continuous, with no categorical or ordinal variables included.
Non-linear trends were observed among most features, making tree-based classifiers an
attractive option for analysis.
4. Strategy
In statistical analysis and machine learning, it is common to bin continuous variables
into discrete categories for analysis. In this case, the target variable has been binned into
three classes using predetermined bin edges.
In order to ensure a balanced distribution of data among the bins, the bin edges were
initially chosen based on a balance of preferred closing price percentage differences and a
similar number of data in each bin. This approach helps to avoid bias in the analysis by
ensuring that each bin contains a representative sample of the data. Once the bins were
established, they were pruned so that each bin contained an equal number of samples. This
further improves the fairness of the analysis by ensuring that each bin carries the same
weight in the analysis. The process is known as stratified sampling and is particularly
useful when dealing with imbalanced datasets where the number of samples in each class
is not equal. By using stratified sampling, we can ensure that the resulting model is not
biased towards one particular class and is better able to accurately predict outcomes for
each class.
The first bin includes all values less than
0.03, which corresponds to a loss in closing
price after 10 trading days of more than 3%. In this bin, the resulting strategy would
be to sell.
The second bin includes all values between
0.03 and 0.04, which corresponds to a
price change within the range of
3% to +4%. In this bin, the resulting strategy would
be to hold.
The third bin includes all values greater than 0.04, which corresponds to a gain in
closing price after 10 trading days of more than +4%. In this bin, the resulting strategy
would be to buy.
5. Prediction Models
5.1. Random Forest
A Random Forest is an ensemble machine learning method for classification and
regression. It consists of a collection of decision trees, where each tree is trained on a random
subset of the data. During prediction, the Random Forest aggregates the predictions of
each individual tree to arrive at the final output. The idea behind this is that by training
multiple trees on different subsets of the data, the overall performance of the model can
be improved by reducing overfitting and increasing the robustness of the predictions.
Additionally, Random Forests also provide a measure of feature importance which can be
used for feature selection.
The concept of a Random Forest was first introduced in the early 2000s by Leo Breiman.
In his paper published in 2001 [
19
], he described the method as a combination of bagging
Electronics 2023,12, 4551 12 of 23
(bootstrap aggregating) and random subspace method. In bagging, multiple models are
trained on different subsets of the training data, while in the random subspace method,
only a random subset of the features are considered for each split in the decision tree.
The idea behind Random Forests was to improve the performance of decision trees,
which often suffer from overfitting when trained on large datasets. By averaging the
predictions of multiple decision trees, Random Forests are able to reduce overfitting and
improve the robustness of the predictions.
5.2. Extra Trees
Extra Trees Classifier [
20
] is an ensemble machine learning algorithm used for classifi-
cation tasks. It belongs to the Random Forest family of algorithms, which builds multiple
decision trees and combines their predictions to form a final output. The Extra Trees Clas-
sifier differs from traditional Random Forest in that it uses random thresholds for each
feature, instead of selecting the best split, to form each decision tree. This randomization
process results in a more diverse set of trees and therefore a more robust final prediction.
5.3. A Comparison of Models
We built, tuned, and compared thirty machine learning classifier models in order to
identify the best performing ones for our specific cases.
It is worth noting that training and testing a set of models using the full dataset or
even larger subsets can significantly increase computational costs and time requirements.
Therefore, we opted for a smaller subset of data to facilitate more efficient experimentation
and analysis. Therefore, although our dataset comprises around 500,000 samples, for the
purposes of comparison we opted to use a smaller training set consisting of 80,000 samples
and a testing set of 20,000 samples. This reduced set includes all 24 features.
More relevant models considered were:
Bagging Classifier (Bootstrapped Aggregating) is an ensemble machine learning tech-
nique that combines the predictions of multiple models to improve the stability and
accuracy of the prediction. The method involves training multiple models on ran-
domly sampled subsets of the training data, and then averaging the predictions of
each model.
Decision Tree Classifier is a supervised learning algorithm used for classification
problems. It models decisions and decision making by learning from the data to
construct a tree-like model of decisions and their possible consequences. The algorithm
works by recursively splitting the data into subsets based on the feature that results in
the largest information gain and assigning a class label to each leaf node in the tree.
The final class label of a sample is determined by traversing the tree from the root to a
leaf node. Decision trees are simple to understand, interpret, and visualize and can
handle both categorical and numerical data.
NuSVC is a Support Vector Machine (SVM) classifier which uses a nu-parameter to
control the number of support vectors. It works by finding the optimal hyperplane
that separates the data points of different classes with the maximum margin. The nu-
parameter controls the trade-off between margin size and number of support vectors.
NuSVC is useful when dealing with large datasets as it uses a subset of training points,
known as support vectors, in the decision function.
XGB Classifier is an implementation of gradient boosting algorithm for classification
problems. It builds a sequence of decision trees to make the final prediction. The
algorithm works by iteratively adding new trees to the model, with each tree trained to
correct the errors of the previous ones. The XGB Classifier also includes regularization
techniques, such as L1 and L2 regularization, to prevent overfitting.
KNeighbors Classifier is a type of instance-based learning algorithm, which uses a
non-parametric method to classify new data points based on the similarity of their
features to those of the training data. It works by assigning a class label to a new data
point based on the class labels of its k-nearest neighbors in the training set. The value
Electronics 2023,12, 4551 13 of 23
of k is chosen by the user and determines the number of neighbors to consider when
making a prediction. KNeighbors Classifier is simple to implement and works well
for small datasets with few dimensions.
LGBM Classifier is a gradient boosting framework that uses tree-based learning algo-
rithms. It stands for Light Gradient Boosting Machine, and it is a scalable and efficient
implementation of gradient boosting specifically designed to handle large datasets.
Quadratic Discriminant Analysis (QDA) is a linear classification method that assumes
a Gaussian distribution of the features within each class and estimates the class
covariance matrices. QDA uses these covariance matrices to calculate the discriminant
function that separates the classes. Unlike Linear Discriminant Analysis (LDA), QDA
does not assume equal covariance between the classes and therefore provides a more
flexible boundary between the classes. It is effective in cases where the class covariance
matrices are different and the classes are well-separated.
Results are shown in Table 1and show that the best performing model for our data is
the Extra Trees Classifier.
Table 1. A comparison of machine learning methods.
Model Accuracy
Extra Trees Classifier 0.75
Random Forest Classifier 0.73
Bagging Classifier 0.67
Decision Tree Classifier 0.63
NuSVC 0.57
XGB Classifier 0.55
KNeighbors Classifier 0.54
LGBM Classifier 0.52
Quadratic Discriminant Analysis 0.44
5.4. Random Forest vs. Extra Trees
Table 1provides a comprehensive overview of various machine learning methods.
Random Forest and Extra Trees emerge as the best-performing candidates. Their high
accuracy scores make them standout models, and therefore, they warrant special attention
and further scrutiny.
Extra Trees and Random Forest are both ensemble learning methods that combine
multiple decision trees to improve predictive performance. However, there are some key
differences between the two algorithms.
Extra Trees is computationally faster. Extra Trees randomly selects features and
thresholds for each split, whereas Random Forest selects the best feature and threshold.
This means that Extra Trees requires less computation and can train more quickly.
Another advantage of Extra Trees is that it can reduce overfitting. In Random Forest,
each tree is built using a bootstrap sample of the training data, which can result in correlated
trees that overfit to the training data. Extra Trees, on the other hand, uses a random
subset of the training data and random splits, which can reduce overfitting and improve
generalization performance.
Extra Trees is a better choice for a large dataset with many features and needs to reduce
computation time and overfitting.
5.5. Feature Importances
The mean decrease in impurity is used to calculate feature importance. This method
assesses the reduction in the impurity criterion achieved from all the splits made by the
trees based on a particular feature, and determines which features are more relevant. Tree-
based algorithms have an in-built feature importance mechanism. However, tree-based
models tend to overestimate the significance of continuous numerical features, as these
features provide more opportunities for the models to split the data in half. To overcome
Electronics 2023,12, 4551 14 of 23
this, we used the permutation feature importance method. We trained the model on the
training set and obtained the model score on the test set, which was used as the baseline.
Subsequently, we shuffled one feature at a time on the test set and fed it to the model to
obtain a new score. If the feature that was shuffled is significant, the model’s performance
should degrade significantly and the score should drop drastically. On the other hand, if
the feature is not important, the model’s performance should remain unaffected.
Our results are summarized in Table 2. A random feature is added for control.
Table 2. Extra Trees features importances.
Features Importance
PF DONCHIAN 0.0523
BOLL L 0.0489
BOLL M 0.0464
A/D 0.0459
FWMA 0.0457
BOLL U 0.0454
ADX 0.0451
MACD 0.0434
CFO 0.0415
SLOPE A50 0.0413
RVGI 0.0409
CMF 0.0400
TSI 0.0397
AROON 0.0397
SLOPE A23 0.0388
SLOPE PFD 0.0383
STC 0.0379
RVI 0.0369
RSI 0.0364
SLOPE VWMA 0.0364
WILLR 0.0363
SLOPE STC 0.0349
SLOPE TSI 0.0342
SLOPE AD 0.0327
RANDOM 0.0028
5.6. Another Features Importance Estimation
We explored another method to investigate the accuracy of Extra Trees Classifier in
predicting the target variable using a subset of features, after pruning the variables based
on binning and downsampling. The study used a range of feature combinations with at
least 10 and at most 24 features, and evaluated each model using accuracy scores.
It is important to note that this method carries a high computational cost as it required
iterations on approximately 14 million models. As a result, we reduced the data sample size
to 10,000. We analyzed the occurrence of the number of features and found that 15 features
were the most frequently occurring, indicating that this number may provide an optimal
balance between model complexity and performance. Please refer to the results of Table 3
for more details.
Table 3.
This table summarizes the occurrences of the number of features in the top 100 performing
models.
Number of Features Occurrences
15 30
16 11
23 9
20 8
17 7
Electronics 2023,12, 4551 15 of 23
Our analysis revealed that the most frequently occurring number of features was
15, suggesting that this number may provide a good balance between model complexity
and performance.
Furthermore, we identified the most commonly occurring features in the top perform-
ing models, which are good candidates as best features for our model. These features
are CFO, ADX, PF DONCHIAN, TSI, MACD, BOLL L, BOLL M, BOLL U, A/D, FWMA,
SLOPE PFD, SLOPE TSC, SLOPE AD, SLOPE A50, and SLOPE A23.
5.7. Selected Features
A total of 24 features were used for the evaluation and comparison based on the
two feature importance methods mentioned before. The first method used permutation
feature importance. The second method used Extra Trees Classifier and identified a set of
15 features that were commonly occurring in the top performing models.
Therefore, it is reasonable to consider the intersection of the important features from
both methods and use a number of 15 features, as suggested by the second method. After
more tests on intersections, we have selected the following 15 features for our model: CFO,
ADX, SLOPE STC, SLOPE PFD, PF DONCHIAN, TSI, SLOPE A50, SLOPE A23, MACD,
BOLL L, BOLL M, BOLL U, A/D, FWMA, and SLOPE AD.
We chose these features because they consistently appeared among the most important
features in both methods of feature selection. Moreover, these features have been shown to
have a high degree of predictive power for our target variable based on prior research and
domain knowledge. Additionally, we believe that these features provide a good balance
between model complexity and performance, and should therefore allow us to build a
robust and accurate model.
5.8. Hyperparameter Tuning
Hyperparameter tuning was conducted through a randomized grid search, which
leverages both the “fit” and “score” methods, focusing on the following hyperparameters:
n estimators number of decision trees in the Random Forest;
max features maximum number of features that are considered at each split in the
decision tree;
max depth maximum depth of the decision trees in the Random Forest;
min samples split minimum number of samples required to split an internal node in the
decision tree;
minimum samples leaf minimum number of samples required to be at a leaf node in the
decision tree;
bootstrap: Bootstrapping in Machine Learning involves creating multiple subsets of
training data by randomly sampling with replacement from the original dataset. This
technique is used to train multiple decision trees, each on a different subset of the
training data. The “bootstrap” hyperparameter is a Boolean value that determines
whether bootstrapping is applied during tree construction. When set to “True”,
bootstrapping is used to create diverse training subsets for each decision tree, aiding
in reducing overfitting and enhancing generalization. Conversely, when set to “False”,
each decision tree is trained on the entire dataset.
Our best results are as follows:
n estimators =200;
max features
=
auto; the algorithm automatically chooses the appropriate number of
features to consider, based on a square root of the total number of features available in
the dataset;
max depth =80;
min samples split =2;
min samples leaf =1;
bootstrap =False.
Electronics 2023,12, 4551 16 of 23
6. Results
6.1. Evaluation Metrics
Accuracy, precision, recall, specificity, balanced accuracy, and F1 score are all evalua-
tion metrics used to evaluate the performance of a classifier, such as a Random Forest or a
Extra Trees Classifier.
Accuracy is the proportion of correctly classified instances out of the total number
of instances. It is defined as (True Positives + True Negatives)/Total. Precision is the
proportion of true positive predictions out of all positive predictions. It is defined as True
Positives/(True Positives + False Positives). Recall (also known as sensitivity) is the propor-
tion of true positive predictions out of all actual positive instances. It is defined as True
Positives/(True Positives + False Negatives). Specificity is the proportion of true negatives
out of all actual negatives instances. It is defined as True negatives/(True negatives +
False positives). Balanced accuracy is the arithmetic mean of sensitivity and specificity,
which is useful when the classes are imbalanced. It is defined as (True Positives/(True
Positives + False Negatives) + True negatives/(True negatives + False positives))/2. The F1
score is the harmonic mean of precision and recall, which is useful when the classes are
imbalanced or when both precision and recall are important. It is defined as 2
(Precision
Recall)/(Precision + Recall).
Here are the formulas:
Accuracy:
TP +T N
Total
Precision: TP
TP +FP
Recall (Sensitivity):
TP
TP +F N
Specificity:
TN
TN +FP
Balanced accuracy:
1
2·(TP
TP +F N +T N
TN +FP )
F1 score:
2·Precision ·Recall
Precision +Recall
where: TP is True Positives, TN is True Negatives, FP is False Positive, and FN is
False Negative
.
The dimensions of the training and testing datasets are given as follows: training
features shape is 409,025 for 15 features, training labels shape is 409,025, testing features
shape is 72,181 for 15 features, and testing labels shape is 72,181. Our results for the the
Extra Trees Classifier model are summarized in Table 4. Results for the selected features
importances are summarized in Table 5.
The model achieved a training score of 0.9989 and a testing score of 0.8608. The
accuracy, precision, recall, balanced accuracy, and F1 score of the model were all around
0.86. These results indicate that the model was highly accurate in predicting the direction
of stock market prices. It is worth noting that Basak [
12
] reported results for a Random
Forest model, which showed an accuracy of around 80% or less (depending on the stock)
for a 10-day trading window. In addition, it should be mentioned that while the Random
Forest model in Basak [
12
] has only two (buy and sell), the Extra Trees Classifier model has
three values for the target (buy, hold, and sell) and allows for a more nuanced prediction of
stock market prices. Therefore, the Extra Trees Classifier model appears to outperform the
Random Forest model.
Electronics 2023,12, 4551 17 of 23
Table 4. Performance Metrics of the Extra Trees Classifier Model.
Results
Training score 0.9989
Testing score 0.8608
Accuracy (test data set) 0.8608
Precision (test data set) 0.8614
Recall (test data set) 0.8608
Balanced Accuracy (test data set) 0.8608
F1 score (test data set) 0.8610
Specificity (test data set) 0.8403
Table 5.
Features importances. Please note that the ranking of a feature can be strongly influenced
by how it interacts with other features. In the full feature set of Table 2, “PF DONCHIAN” has
favorable interactions that boost its importance or performance. In this table only a subset of features
is considered and the absence of specific interacting features diminishes its ranking.
Features Importances
BOLL L 0.0748
BOLL M 0.0718
ADX 0.0717
BOLL U 0.0716
FWMA 0.0712
TSI 0.0710
MACD 0.0708
PF DONCHIAN 0.0682
A/D 0.0668
SLOPE50 0.0651
CFO 0.0624
SLOPE23 0.0624
SLOPE STC 0.0611
SLOPE PFD 0.0574
SLOPE AD 0.0536
6.2. Experimental Results
Simulations were conducted to evaluate the performance of our stock trading model.
The simulations involved investing in each of 15 stocks, namely: ALB, CAT, CSCO, DHI,
DHR, DPZ, ENPH, IDXX, LRCX, MCD, MU, PG, TGT, TWTR, and XOM. These stocks were
not used during the training and testing of the model, and therefore were independent of it.
At the beginning of the simulations on 18 January 2022, $10,000 was invested in each
stock. The simulations ran until 21 February 2023. During this period the trading signals
generated by the model were used to buy and sell the stocks. The value of the portfolio
at the end of the simulations was calculated in two ways. The first method calculated the
portfolio value based on the trading signals generated by the model, while the second
method calculated the portfolio value assuming that the stocks were held without any
trading activity.
The results of the simulations (see Table 6) showed that the model generated positive
returns for twelve stocks and negative returns for the remaining four. Holding gives
positive returns only in six cases out of fifteen. Moreover, for all the stocks considered, the
returns from the model were better than the ones from the hold strategy: the percentage
difference between the portfolio value based on the trading signals generated by the model
and the portfolio value assuming that the stocks were held without any trading activity
was positive for all the stocks.
The total investment was $150,000. At the end of the simulations, the total portfolio
value based on the trading signals generated by the model was $178,638.89, while the
total portfolio value assuming that the stocks were held without any trading activity was
$157,116.46.
Electronics 2023,12, 4551 18 of 23
It is worth noting that during this period, the market was turbulent, with the Federal
Reserve increasing interest rates. As a result, the S&P 500 index experienced a decline of
approximately 12.7%, falling from 4577 to 3997.
The results of the simulations suggest that the trading model has the potential to
generate positive returns for some stocks, but not for all stocks. The percentage difference
between the portfolio value based on the trading signals generated by the model and the
portfolio value assuming that the stocks were held without any trading activity provides
an indication of the effectiveness of the trading signals.
Table 6.
Stock returns and comparison of trading strategy with holding strategy for the period 18
January 2022 to 21 February 2023.
Stock Invested Asset
Trading
Asset
Hold
Trading
Strategy
Return
Hold
Return
%
Difference
ALB $10,000 $10,846.27 $10,692.00 +8.46% +6.92% +1.54%
CAT $10,000 $10,681.18 $10,471.57 +6.81% +4.72% +2.09%
CSCO $10,000 $9102.07 $8319.10 -8.98% 16.81% +7.83%
DHI $10,000 $10,244.50 $9661.74 +2.45% 3.38% +5.83%
DHR $10,000 $11,303.50 $8695.91 +13.04% 13.04% +26.08%
DPZ $10,000 $10,399.20 $7508.25 +3.99% 24.92% +28.91%
ENPH $10,000 $14,904.29 $14,835.93 +49.04% +48.36% +0.68%
IDXX $10,000 $9806.33 $9228.05 1.94% 7.72% +5.02%
LRCX $10,000 $12,559.28 $7043.93 +25.59% 29.56% +55.15%
MCD $10,000 $10,959.96 $10,470.60 +9.60% +4.71% +4.89%
MU $10,000 $8490.96 $6202.22 15.09% 37.98% +22.89%
PG $10,000 $11,616.68 $8926.82 +16.17% 10.73% +26.90%
TGT $10,000 $10,158.78 $7555.11 +1.59% 24.45% +26.04%
TWTR 1$10,000 $16,841.23 $14,396.78 +68.41% +43.97% +24.44%
XOM $10,000 $15,445.01 $15,212.10 +54.45% +52.12% +2.15%
Total $150,000
$178,638.89 $157,116.46
+19.09% +4.74% +14.35%
1
TWTR stock was delisted from the NYSE on 8 November 2022 after Elon Musk bought all the company’s
outstanding shares for $54.20 per share. Therefore, both models were forced to sell on the date for that price.
6.3. Trading Recommendations
Figures 1and 2display trading recommendations generated by our model for a
selection of equities not used during the training and testing of the model, and therefore
independent of it. The model’s suggested trading decisions are indicated by colored dots.
Specifically, red dots denote a forecasted drop in prices, while green dots denote a forecasted
rise in prices, both occurring within a 10 day interval. The model’s predictions serve as
a guide for the recommendation of a purchase or sell decision. The graph indicates that
the model advocates for purchasing when an increase in prices is predicted after 10 days,
while it advises selling when a decrease in prices is predicted after 10 days.
The trading recommendations produced by our model for a subset of equities utilized
in both the training and testing phases are depicted in Figure 3. It is important to note that
since these equities were part of the model’s training data, the recommendations pertain to
a time period not previously encountered by the model (specifically, the year 2023).
In Figure 4, we present two singular cases concerning two pharmaceutical sector stocks
exhibiting opposite and presumably unpredictable behavior, which our model appears to
have predicted. On 13 February 2023, Frequency Therapeutics Inc. shares plummeted by
over 75% to reach an all-time low in early trading. This setback followed the regenerative
medicine company’s announcement of the termination of its primary program after its
failure in a Phase 2b study. Notably, our model predicted this drop.
Conversely, on 13 March 2023, Provention Bio’s shares soared following its acquisition
agreement with French pharmaceutical company Sanofi for $2.9 billion, or $25 a share. The
biopharmaceutical company, specializing in autoimmune diseases, experienced a 258%
Electronics 2023,12, 4551 19 of 23
surge in its stock to $24, up from Friday’s closing price of $6.70. Our model also predicted
this surge.
Figure 1.
Trading recommendations for the period March 2022–March 2023 generated by the model
for a subset of stocks (ALB, AMGN, CAT, CSCO, DHI, DHR). These stocks were not used during the
training and testing of the model. As usual, red dots denote a forecasted drop in prices, while green
dots denote a forecasted rise in prices.
Electronics 2023,12, 4551 20 of 23
Figure 2.
Trading recommendations for the period March 2022–March 2023 generated by the model
for another subset of stocks (IDXX, LRCX, PG, TGT). These stocks were not used during the training
and testing of the model. As usual, red dots denote a forecasted drop in prices, while green dots
denote a forecasted rise in prices.
Figure 3. Cont.
Electronics 2023,12, 4551 21 of 23
Figure 3.
Trading recommendations for the period January to March 2023 generated by the model
for a subset of stocks (ADBE, AMD, AMZN, ASML, META, MSFT) used during the training and
testing of the model. The recommendations pertain to a time period not previously encountered by
the model. As usual, red dots denote a forecasted drop in prices, while green dots denote a forecasted
rise in prices.
Figure 4.
On the left: in early trading on 13 February 2023, the shares of Frequency Therapeutics
Inc. experienced a decline of over 75% and reached an all-time low. The setback occurred after the
regenerative medicine company announced the termination of its primary program, subsequent to its
failure in a Phase 2b study. Interestingly enough, according to our model this drop was predictable.
On the right: Provention Bio’s shares surged on Mar 13th 2023 following its agreement to be acquired
by French pharmaceutical company Sanofi for $2.9 billion, or $25 a share. The biopharmaceutical
company, which specializes in autoimmune diseases, saw its stock rise by 258% to $24, up from
Friday’s closing price of $6.70. Again, interestingly enough, according to our model this surge
was predictable.
7. Discussion
Predicting the direction of stock market prices is a challenging task that has been the
subject of much research in the field of finance and economics. One of the most popular
methods for making stock market predictions is the use of machine learning, specifically
tree-based algorithm.
In this paper we aimed to utilize advanced machine learning models to predict sig-
nificant fluctuations in asset prices in the stock market. We evaluated various technical
indicators to train a set of classifier models. The performance of the models were evaluated
using several metrics. The results showed that our best model is an Extra Trees Classi-
fier that achieved an accuracy of 86.1%, indicating that the model outperforms the more
classical Random Forest model and could predict significant fluctuations in asset prices.
An Extra Trees Classifier is an ensemble machine learning method that consists of a
collection of decision trees. Each tree is trained on a random subset of the data, and during
prediction, the Extra Trees Classifier aggregates the predictions of each individual tree
to arrive at the final output. One of the advantages of using an Extra Trees Classifier for
stock market predictions is its ability to handle large data sets with a large number of input
features. The Extra Trees algorithm was used to predict the direction of stock market prices
Electronics 2023,12, 4551 22 of 23
by training the model on preprocessed historical stock market data. The input features for
the model include historical stock closing prices, trading volumes, and various technical
indicators. The results showed that the most important features were BOLL L with an
importance score of 0.0748, followed closely by BOLL M, ADX, BOLL U, FWMA, TSI, and
MACD. The least important features are SLOPE AD and SLOPE PFD with importance
scores of 0.0536 and 0.0574, respectively. The results suggest that BOLL L and BOLL M may
be the most informative features for predicting the price direction. The output of the model
is a prediction of whether the stock market will significantly rise or fall in a 10 trading
days window.
While Extra Trees Classifiers can be useful for making stock market predictions, it is
important to acknowledge that they are not always entirely accurate. The stock market
is a complex and dynamic system that is influenced by various factors, some of which
are difficult to predict. Furthermore, stock market predictions are inherently subject to
volatility and uncertainty, and should be considered as a tool for guidance rather than a
definitive answer. In summary, the Extra Trees Classifier algorithm can be a powerful tool
for predicting stock market directions by training on historical stock market data and using
various input features such as technical indicators and historical stock prices. However, it
is crucial to recognize the limitations of these predictions and use them as a guide.
8. Conclusions
This work introduced an Extra Trees Classifier model optimized for short-term stock
market return forecasting. The results demonstrated its effectiveness in achieving a high
accuracy of 86.1%, outperforming classical methods such as Random Forests. This high-
lights the promise of classification-based machine learning techniques for stock prediction
as an improvement over prevailing regression approaches. However, while these models
can provide useful guidance, their probabilistic forecasts have limitations. The intricate
dynamics of financial markets imply inherent uncertainty in price predictions. Therefore,
model outputs should be considered predictive rather than definitive. This study con-
tributes an initial exploration into classification algorithms for stock forecasting, opening
up avenues for further research into hybrid models and ensemble techniques to enhance
market insights.
Funding: This research received no external funding.
Data Availability Statement:
The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Acknowledgments:
For the computation of the technical indicators of Section 2, we used Pandas
TA—A Technical Analysis Library in Python 3 [
15
]. The author acknowledges supercomputing
resources and support from ICSC—Centro Nazionale di Ricerca in High Performance Computing, Big
Data and Quantum Computing and hosting entity, funded by European Union—NextGenerationEU.
The author wishes to thank Pierluca Sangiorgi for suggestions, discussions and for testing the model
on the field, the friends from Shhtonks and B & Mercati groups for enjoying together bull and bear
markets and Matteo “il Don” for daily suggestions.
Conflicts of Interest:
During the preparation of this work the author used ChatGPT OpenAI and
Claude 2 Anthropic in order to improve readability and language. After using this tool, the au-
thor reviewed and edited the content as needed and takes full responsibility for the content of
the publication.
Appendix A. List of Stocks Used for Training the Model
AAPL, ABT, ADBE, AKTX, AMD, AMTI, AMZN, ANET, APAM, APPN, ASML, ATHX,
ATOM, ATVI, AVLR, AXTI, BA, BKNG, BRK-B, BTWNW, BYND, CCI, CDNS, CLVS, COUP,
CRWD, CTXR, DBX, DIS, DOCU, EA, EOLS, EQX, ETSY, FB/META, FICO, FLGT, FREQ,
FSLY, FVRR, GOOGL, GSV, HMI, HUIZ, ICD, ILMN, INTC, INTU, ISRG, IZEA, JD, JMIA,
JMP, JNJ, KEYS, KO, KOPN, LODE, LTRN, LYL, MA, MELI, MGNI, MSFT, MTSL, NBIX,
NCTY, NET, NFLX, NVDA, NVTA, OBSV, OKTA, OLED, OPEN, PACB, PHVS, PINS,
Electronics 2023,12, 4551 23 of 23
PLTR, PODD, PYPL, QCOM, RENN, ROKU, SABR, SE, SEDG, SHOP, SINO, SNAP, SNE,
SNOW, SPCE, SQ, SQM, TAOP, TDOC, TMO, TRIP, TSLA, TSM, TTD, TWLO, TWST, U,
UAMY, UBX, UPST, UUUU, V, VTGN, WATT, WCC, WMT, WRAP, XNET, YNDX, ZBRA,
ZM, ZNGA.
References
1.
Malkiel, B.G.; Fama, E.F. Efficient capital markets: A review of theory and empirical work. J. Financ.
1970
,25, 383–417. [CrossRef]
2. Jensen, M.C. Some anomalous evidence regarding market efficiency. J. Financ. Econ. 1978,6, 95–101. [CrossRef]
3.
Avery, C.N.; Chevalier, J.A.; Zeckhauser, R.J. The CAPS prediction system and stock market returns. Rev. Financ.
2016
,20,
1363–1381. [CrossRef]
4.
Christoffersen, P.F.; Diebold, F.X. Financial asset returns, direction-of-change forecasting, and volatility dynamics. Manag. Sci.
2006,52, 1273–1287. [CrossRef]
5.
Hellstrom, T.; Holmstromm, K. Predictable Patterns in Stock Returns; Technical Report Series IMa-TOM, 1997-09; 1998. Available
online: https://api.semanticscholar.org/CorpusID:150923793 (accessed on 3 November 2023)
6.
Saha, S.; Routh, S.; Goswami, B. Modeling Vanilla Option prices: A simulation study by an implicit method. J. Adv. Math.
2014
,6,
834–848.
7.
Widom, J. Research problems in data warehousing. In Proceedings of the Fourth International Conference on Information and
Knowledge Management, CIKM ’95, Baltimore, MD, USA, 29 November–2 December 1995; ACM: New York, NY, USA, 1995;
pp. 25–30.
8.
Kumbure, M.M.; Lohrmann, C.; Luukka, P.; Porras, J. Machine learning techniques and data for stock market forecasting: A
literature review. Expert Syst. Appl. 2022,197, 116659. [CrossRef]
9.
Kara, Y.; Boyacioglu, M.A.; Baykan, C. Predicting direction of stock price index movement using artificial neural networks and
support vector machines: The sample of the Istanbul stock exchange. Expert Syst. Appl. 2011,38, 5311–5319. [CrossRef]
10.
Adebiyi, A.A.; Adewumi, A.O.; Ayo, C. Comparison of ARIMA and artificial neural networks models for stock price prediction.
J. Appl. Math. 2014,2014, 614342. [CrossRef]
11.
de Faria, E.L.; Albuquerque, M.P.; Gonzalez, J.L.; Cavalcante, J. Predicting the Brazilian stock market through neural networks
and adaptive exponential smoothing methods. Expert Syst. Appl. 2009,36, 12506–12509. [CrossRef]
12.
Basak, S.; Kar, S.; Saha, S.; Khaidem, L.; Dey, S.R. Predicting the direction of stock market prices using tree-based classifiers.
N. Am. J. Econ. Financ. 2019,47, 552–567. [CrossRef]
13.
Bruno, A.; Pagliaro, A.; La Parola, V. Application of Machine and Deep Learning Methods to the Analysis of IACTs Data. In
Intelligent Astrophysics; Zelinka, I., Brescia, M., Baron, D., Eds.; Emergence, Complexity and Computation, Volume 39; Springer:
Berlin/Heidelberg, Germany, 2021; pp. 115–136.
14.
Pagliaro, A.; Cusumano, G.; La Barbera, A.; La Parola, V.; Lombardi, S. Application of Machine Learning Ensemble Methods to
ASTRI Mini-Array Cherenkov Event Reconstruction. Appl. Sci. 2023,13, 8172. [CrossRef]
15.
Twopirllc. Pandas-TA: Technical Analysis Indicators for Pandas. Available online: https://twopirllc.github.io/pandas-ta/
(accessed on 3 November 2023).
16. Appel, G. The MACD Momentum Indicator. Tech. Anal. Stock. Commod. 1985,3, 84–88.
17.
ProRealCode. Schaff Trend Cycle (STC). Available online: https://www.prorealcode.com/prorealtime-indicators/schaff-trend-
cycle2/ (accessed on 3 November 2023).
18. Williams, L. How I Made One Million Dollars Last Year Trading Commodities; FutureBooks: Singapore, 1973.
19. Breiman, L. Random Forests. Mach. Learn. 2001,45, 5–32. [CrossRef]
20. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006,63, 3–42. [CrossRef]
Disclaimer/Publisher’s Note:
The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
... There are various applications, including the following: financial forecasting [1][2][3][4], astrophysical event reconstruction [5][6][7][8][9], cloud masking in satellite imagery [10][11][12], patient health monitoring [13,14], traffic flow optimization [15,16], energy demand prediction [17], climate change analysis [18,19], and cybersecurity threat detection [20,21]. Recent advancements include unsupervised frameworks like Auto-CM [22] for cloud masking, Fourierbased feature engineering for financial applications [23], and Transformer models for ECG analysis [24]. ...
... Ensemble methods like Random Forests and Extra Trees Classifiers combine multiple models to enhance predictive accuracy. In finance, these methods have demonstrated superior performance in forecasting stock price changes by capturing complex feature interactions [4]. Ensemble learning has also been applied to healthcare time series data for early disease detection [28]. ...
... In finance, ML techniques are used for predicting significant market movements, risk assessment, portfolio optimization, and fraud detection. The Extra Trees Classifier has been particularly effective in forecasting stock price changes by capturing complex feature interactions [4]. Fourier transform-based feature engineering has further improved the accuracy of financial forecasting models by uncovering hidden periodicities in stock market data [23,29,30]. ...
Article
Full-text available
Time series analysis and pattern recognition are cornerstones for innovation across diverse domains. In finance, these techniques enable market prediction and risk assessment. Astrophysicists use them to detect various phenomena and analyze data. Environmental scientists track ecosystem changes and pollution patterns, while healthcare professionals monitor patient vitals and disease progression. Transportation systems optimize traffic flow and predict maintenance needs. Energy providers balance grid loads and forecast consumption. Climate scientists model atmospheric changes and extreme weather events. Cybersecurity experts identify threats through anomaly detection in network traffic patterns. This editorial introduces this Special Issue, which explores state-of-the-art AI and machine learning (ML) techniques, including Long Short-Term Memory (LSTM) networks, Transformers, ensemble methods, and AutoML frameworks. We highlight innovative applications in data-driven finance, astrophysical event reconstruction, cloud masking, and healthcare monitoring. Recent advancements in feature engineering, unsupervised learning frameworks for cloud masking, and Transformer-based time series forecasting demonstrate the potential of these technologies. The papers collected in this Special Issue showcase how integrating domain-specific knowledge with computational innovations provides a pathway to achieving higher accuracy in time series analysis across various scientific disciplines.
... The indicators used to predict the direction of the stock price are the medium, upper, and lower Bollinger bands, referred to as BOLL M, BOLL U, and BOLL L; average true range (ATR); average directional index (ADX); fractal weighted moving average (FWMA); volumeweighted moving average (VWMA); Chande forecast oscillator (CFO); and Schaff trend cycle (STC). evaluating ETC, which was trained on decision trees, against bagging, nu-support vector machine (SVM), K-neighbors, XGBoost (XGB), and light gradient boosting machine (LGBM) classifiers, the results indicate that ETC outperformed the others, achieving an accuracy of 86.1% [10]. The stock market has a huge influence on a variety of things, such as jobs, businesses, technology, and the economy. ...
... Level 3: For iteration 1 to . Level 4: For each particle 1 to , calculate the new position and velocity using (9) and (10). ...
Article
Full-text available
The number of stock investors is steadily increasing due to factors such as the availability of high-speed internet, smart trading platforms, lower trading commissions, and the perception that trading is an effective way of earning extra income to enhance financial stability. Accurate forecasting is crucial to earning profits in the stock market, as it allows traders to anticipate price changes and make strategic investments. The traders must skillfully negotiate short-term market changes to maximize gains and minimize losses, as intraday profit mostly depends on the timing of buy and sell decisions. In the presented work, we provide minute-by-minute forecasts that assist intraday traders in making the best decisions on when to buy and sell, consequently maximizing profits on each trade they make. We have implemented a one-dimensional convolutional neural network and bidirectional long-short-term memory (1DCNN-BiLSTM) optimized with particle swarm optimizer (PSO) to forecast the value of stocks for each minute using real-time data extracted from Yahoo Finance. The proposed method is evaluated against state-of-the-art technology, and the results demonstrate its strong potential to accurately forecast the opening price, stock movement, and price for the next timeframe. This provides valuable insights for intraday traders to make informed buy or sell decisions.
... Maintenance and fault detection influence wind turbine longevity and performance consistency. The literature analyzes various advanced techniques, including ML approaches for predictive maintenance and performance diagnostics [22,23]. Yan et al. [24] and Palasai et al. [25] present the integration of self-diagnosis systems to predict potential failures in Appl. ...
Article
Full-text available
Wind energy represents a solution for reducing environmental impact. For this reason, this research studies the elements that propose optimizing wind energy production through intelligent solutions. Although there are studies that address the optimization of turbine performance or other indirectly related factors in wind energy production, the optimization of wind energy production remains a topic insufficiently explored and synthesized in the literature. This research studies how machine learning (ML) techniques can be applied to optimize wind energy production. This research aims to study the systematic applications of ML to identify and analyze the key stages of optimized wind energy production. Through this research, case studies are highlighted by which ML methods are proposed that directly target the issue of optimizing the wind power process through wind turbines. From the total of 1049 articles obtained from the Web of Science database, the most studied ML models in the context of wind energy are the artificial neural networks, with 478 papers identified. Additionally, the literature identifies 224 articles that have studied random forest and 114 that have incorporated gradient boosting about wind power. Among these, 60 articles have specifically addressed the issue of optimizing wind energy production. This aspect allows for the identification of gaps in the literature. The research notes that previous studies have focused on wind forecasting, fault detection, or turbine efficiency. The existing literature addresses the indirect optimization of component performance. Thus, this paper identifies gaps in the current research, discusses ML algorithms in the context of optimizing wind energy production processes, and identifies future directions for increasing the efficiency of wind turbines through integrated predictive methods.
... This randomized node-splitting process yields a more diverse set of trees, resulting in a more robust model. Empirical results have indicated superior predictive performances of this model over Random Forest in forecasting financial market fluctuations [98]. ...
Preprint
Full-text available
Financial markets have experienced significant instabilities in recent years, creating unique challenges for trading and increasing interest in risk-averse strategies. Distributional Reinforcement Learning (RL) algorithms, which model the full distribution of returns rather than just expected values, offer a promising approach to managing market uncertainty. This paper investigates this potential by studying the effectiveness of three distributional RL algorithms for natural gas futures trading and exploring their capacity to develop risk-averse policies. Specifically, we analyze the performance and behavior of Categorical Deep Q-Network (C51), Quantile Regression Deep Q-Network (QR-DQN), and Implicit Quantile Network (IQN). To the best of our knowledge, these algorithms have never been applied in a trading context. These policies are compared against five Machine Learning (ML) baselines, using a detailed dataset provided by Predictive Layer SA, a company supplying ML-based strategies for energy trading. The main contributions of this study are as follows. (1) We demonstrate that distributional RL algorithms significantly outperform classical RL methods, with C51 achieving performance improvement of more than 32\%. (2) We show that training C51 and IQN to maximize CVaR produces risk-sensitive policies with adjustable risk aversion. Specifically, our ablation studies reveal that lower CVaR confidence levels increase risk aversion, while higher levels decrease it, offering flexible risk management options. In contrast, QR-DQN shows less predictable behavior. These findings emphasize the potential of distributional RL for developing adaptable, risk-averse trading strategies in volatile markets.
... Ten widely used classification algorithms, renowned for their high effectiveness in medical applications, were selected for the experiment. These methods included decision tree [26], extra tree [27], gradient boosting [28], nearest neighbor algorithm [29], LightGBM [30], logistic regression [31], NU-SVM support vector machine [32], random forest [33], XGBoost [34], and extra trees [35] models. Deep learning methods were not used in these studies due to the small sample size and the limited number of features in the dataset. ...
Article
Full-text available
Modern technologies, particularly artificial intelligence methods such as machine learning, hold immense potential for supporting doctors with cancer diagnostics. This study explores the enhancement of popular machine learning methods using a bio-inspired algorithm—the naked mole-rat algorithm (NMRA)—to assess the malignancy of thyroid tumors. The study utilized a novel dataset released in 2022, containing data collected at Shengjing Hospital of China Medical University. The dataset comprises 1232 records described by 19 features. In this research, 10 well-known classifiers, including XGBoost, LightGBM, and random forest, were employed to evaluate the malignancy of thyroid tumors. A key innovation of this study is the application of the naked mole-rat algorithm for parameter optimization and feature selection within the individual classifiers. Among the models tested, the LightGBM classifier demonstrated the highest performance, achieving a classification accuracy of 81.82% and an F1-score of 86.62%, following two-level parameter optimization and feature selection using the naked mole-rat algorithm. Additionally, explainability analysis of the LightGBM model was conducted using SHAP values, providing insights into the decision-making process of the model.
Conference Paper
Subterranean hydrogen storage on a large scale is an essential component of the value chain for the hydrogen economy and a prerequisite for the successful replacement of carbon-based fuels. Recent research has concentrated on the wettability of rock-H2-brine systems, as measured by contact angles, because of the effects it has on underground hydrogen storage fluid flow, H2 migration, and recovery efficacy. However, the contact angle data sets that have been reported are highly inconsistent. Furthermore, in comparison to the contact angle data for quartz, shale, mica, and calcite, the literature provides a scarcity of information regarding the contact angles of H2/brine on Saudi Arabian basalt (SAB). This study focuses on accurately modeling the wettability behavior of a ternary system comprising H2, Saudi Arabian basalt (SAB), and 0.3 molar NaCl brine solution under various physio-thermal conditions (298 – 323k and 0.1 – 20 MPa) in presence of organics (stearic acid-aged; 10−2 mol/L) and nanofluids (0.05, 0.1, 0.25, and 0.75 wt%; SiO2) using various machine learning techniques, including, Bayesian ridge, Extra trees, CatBoost, Gradient Boosting, Extreme Gradient Boosting, and Random Forest. A comprehensive dataset, derived from laboratory experiments conducted under realistic pressure and temperature conditions was utilized. To enhance the understanding of the dataset, various graphical exploratory data analysis techniques were employed. The model's generalization capabilities were improved through k-fold cross-validation and grid search optimization. The machine learning models were trained to predict the advancing and receding contact angles of the Saudi Arabian basalt/H2/brine systems. Statistical evaluation and graphical analysis were used to assess the models' reliability and performance. The results demonstrated that the machine learning models accurately predicted the wettability behavior across different operating conditions. The XGB model achieved highest accuracy, evidenced by low average absolute percent relative errors and high R2 values. The investigation into feature importance revealed that pressure exerted the most significant influence on the contact angles within the SAB/H2/brine system. Accurate predictions of wettability behavior can enhance the estimation of H2 geo-storage capacities and ensure containment security in large-scale geo-sequestration projects.
Article
Full-text available
The study explore machine learning (ML) techniques to predict temperature-dependent photoluminescence (PL) spectra in colloidal CdSe nanoplatelets (NPLs), leveraging polynomial regression models trained on experimental data from 85 to 270 K spanning temperatures to forecast PL spectra backward to 0 K and forward to 300 K. 6th-degree polynomial models with Tweedie regression were optimal for band energy (B1B_1) predictions up to 300 K, while 9th-degree models with LassoLars and Linear Regression regressors were suitable for backward predictions to 0 K. For exciton energy (B2B_2), the Lasso model of degree 5 and the Ridge model of degree 4 performed well up to 300 K, while the Tweedie model of degree 2 and Theil-Sen model of degree 2 showed promise for predictions to 0 K. Furthermore, a GA-based approach was utilized to fit experimental data to theoretical model of Fan and Varshni equations, facilitating a comparative analysis with the ML-predicted curves.
Article
Full-text available
The Imaging Atmospheric Cherenkov technique has opened up previously unexplored windows for the study of astrophysical radiation sources in the very high-energy (VHE) regime and is playing an important role in the discovery and characterization of VHE gamma-ray emitters. However, even for the most powerful sources, the data collected by Imaging Atmospheric Cherenkov Telescopes (IACTs) are heavily dominated by the overwhelming background due to cosmic-ray nuclei and cosmic-ray electrons. As a result, the analysis of IACT data necessitates the use of a highly efficient background rejection technique capable of distinguishing a gamma-ray induced signal through identification of shape features in its image. We present a detailed case study of gamma/hadron separation and energy reconstruction. Using a set of simulated data based on the ASTRI Mini-Array Cherenkov telescopes, we have assessed and compared a number of supervised Machine Learning methods, including the Random Forest method, Extra Trees method, and Extreme Gradient Boosting (XGB). To determine the optimal weighting for each method in the ensemble, we conducted extensive experiments involving multiple trials and cross-validation tests. As a result of this thorough investigation, we found that the most sensitive Machine Learning technique applied to our data sample for gamma/hadron segregation is a Stacking Ensemble Method composed of 42% Extra Trees, 28% Random Forest, and 30% XGB. In addition, the best-performing technique for energy estimation is a different Stacking Ensemble Method composed of 45% XGB, 27.5% Extra Trees, and 27.5% Random Forest. These optimal weightings were derived from extensive testing and fine-tuning, ensuring maximum performance for both gamma/hadron separation and energy estimation.
Article
Full-text available
In this literature review, we investigate machine learning techniques that are applied for stock market prediction. A focus area in this literature review is the stock markets investigated in the literature as well as the types of variables used as input in the machine learning techniques used for predicting these markets. We examined 138 journal articles published between 2000 and 2019. The main contributions of this review are: 1) an extensive examination of the data, in particular, the markets and stock indices covered in the predictions, as well as the 2173 unique variables used for stock market predictions, including technical indicators, macro-economic variables, and fundamental indicators, and 2) an in-depth review of the machine learning techniques and their variants deployed for the predictions. In addition, we provide a bibliometric analysis of these journal articles, highlighting the most influential works and articles.
Article
Full-text available
This paper examines the forecasting performance of ARIMA and artificial neural networks model with published stock data obtained from New York Stock Exchange. The empirical results obtained reveal the superiority of neural networks model over ARIMA model. The findings further resolve and clarify contradictory opinions reported in literature over the superiority of neural networks and ARIMA model and vice versa.
Article
Full-text available
Option contracts can be valued by using the Black-Scholes equation, a partial differential equation with initial conditions. An exact solution for European style options is known. The computation time and the error need to be minimized simultaneously. In this paper, the authors have solved the Black-Scholes equation by employing a reasonably accurate implicit method. Options with known analytic solutions have been evaluated. Furthermore, an overall second order accurate space and time discretization is proposed in this paper Keywords: Computational finance, implicit methods, finite differences, call/put options.
Book
This present book discusses the application of the methods to astrophysical data from different perspectives. In this book, the reader will encounter interesting chapters that discuss data processing and pulsars, the complexity and information content of our universe, the use of tessellation in astronomy, characterization and classification of astronomical phenomena, identification of extragalactic objects, classification of pulsars and many other interesting chapters. The authors of these chapters are experts in their field and have been carefully selected to create this book so that the authors present to the community a representative publication that shows a unique fusion of artificial intelligence and astrophysics.
Article
Predicting returns in the stock market is usually posed as a forecasting problem where prices are predicted. Intrinsic volatility in the stock market across the globe makes the task of prediction challenging. Consequently, forecasting and diffusion modeling undermines a diverse range of problems encountered in predicting trends in the stock market. Minimizing forecasting error would minimize investment risk. In the current work, we pose the problem as a direction-predicting exercise signifying gains and losses. We develop an experimental framework for the classification problem which predicts whether stock prices will increase or decrease with respect to the price prevailing n days earlier. Two algorithms, random forests, and gradient boosted decisio‘n trees (using XGBoost) facilitate this connection by using ensembles of decision trees. We test our approach and report the accuracies for a variety of companies as improvement over existing predictions. A novelty of the current work is about the selection of technical indicators and their use as features, with high accuracy for medium to long-run prediction of stock price direction.
Article
We study approximately 5.0 million stock picks submitted by individual users to the "CAPS" website run by the Motley Fool company (www.caps.fool.com). These picks prove to be surprisingly informative about future stock prices. Shorting stocks with a disproportionate number of negative picks and buying stocks with a disproportionate number of positive picks yields a return of over 12% per annum over the sample period. Negative picks mostly drive these results; they strongly predict future stock price declines. Returns to positive picks are statistically indistinguishable from the market. A Fama-French decomposition suggests that stock-picking rather than style factors largely produced these results.
Article
We consider three sets of phenomena that feature prominently in the financial economics literature: (1) conditional mean dependence (or lack thereof) in asset returns, (2) dependence (and hence forecastability) in asset return signs, and (3) dependence (and hence forecastability) in asset return volatilities. We show that they are very much interrelated and explore the relationships in detail. Among other things, we show that (1) volatility dependence produces sign dependence, so long as expected returns are nonzero, so that one should expect sign dependence, given the overwhelming evidence of volatility dependence; (2) it is statistically possible to have sign dependence without conditional mean dependence; (3) sign dependence is not likely to be found via analysis of sign autocorrelations, runs tests, or traditional market timing tests because of the special nonlinear nature of sign dependence, so that traditional market timing tests are best viewed as tests for sign dependence arising from variation in expected returns rather than from variation in volatility or higher moments; (4) sign dependence is not likely to be found in very high-frequency (e.g., daily) or very low-frequency (e.g., annual) returns; instead, it is more likely to be found at intermediate return horizons; and (5) the link between volatility dependence and sign dependence remains intact in conditionally non-Gaussian environments, for example, with time-varying conditional skewness and/or kurtosis.