PreprintPDF Available

Effectiveness of Artificial Intelligence in Stock Market Prediction based on Machine Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper tries to address the problem of stock market prediction leveraging artificial intelligence (AI) strategies. The stock market prediction can be modeled based on two principal analyses called technical and fundamental. In the technical analysis approach, the regression machine learning (ML) algorithms are employed to predict the stock price trend at the end of a business day based on the historical price data. In contrast, in the fundamental analysis, the classification ML algorithms are applied to classify the public sentiment based on news and social media. In the technical analysis, the historical price data is exploited from Yahoo Finance, and in fundamental analysis, public tweets on Twitter associated with the stock market are investigated to assess the impact of sentiments on the stock market's forecast. The results show a median performance, implying that with the current technology of AI, it is too soon to claim AI can beat the stock markets.
Content may be subject to copyright.
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
Effectiveness of Artificial Intelligence in Stock Market
Prediction Based on Machine Learning
Sohrab Mokhtari
Electrical and Computer Engineering
Florida International University
Miami, USA.
somokhta@fiu.edu
Kang K Yen
Electrical and Computer Engineering
Florida International University
Miami, USA.
yenk@fiu.edu
Jin Liu
Electrical and Computer Engineering
Florida International University
Miami, USA.
jiliu@fiu.edu
ABSTRACT
This paper tries to address the problem of stock market prediction
leveraging artificial intelligence (AI) strategies. The stock market
prediction can be modeled based on two principal analyses called
technical and fundamental. In the technical analysis approach,
the regression machine learning (ML) algorithms are employed
to predict the stock price trend at the end of a business day
based on the historical price data. In contrast, in the fundamental
analysis, the classification ML algorithms are applied to classify the
public sentiment based on news and social media. In the technical
analysis, the historical price data is exploited from Yahoo Finance,
and in fundamental analysis, public tweets on Twitter associated
with the stock market are investigated to assess the impact of
sentiments on the stock market’s forecast. The results show a
median performance, implying that with the current technology of
AI, it is too soon to claim AI can beat the stock markets.
Keywords
Machine learning, time series prediction, technical analysis,
sentiment embedding, financial market.
1. INTRODUCTION
Stock markets are always an attractive investment way to grow
capital. With the development of communication technology, the
stock markets are getting more popular among individual investors
in recent decades. While year by year, the number of shareholders
and companies is growing in the stock markets, many try to
find a solution to predict a stock market’s future trend. This
is a challenging problem with a multitude of complex factors
that are impacting the price changes. Here, prediction algorithms
such as Kalman filter [1] and optimization methods such as Nash
equilibrium [2] can be helpful; but, for this specific problem, AI can
play a significant role. For this, ML methods are developed in many
research papers to evaluate the prediction power of AI in the stock
markets. The ML algorithms that are implemented for this purpose
mostly try to figure out patterns of data, measure the investment
risk, or predict the investment future.
This field’s efforts have led to two central theoretical hypotheses:
Efficient Market Hypothesis (EHM) and Adaptive Market
Hypothesis (AMH). The EMH [3] claims that the spot market price
is ultimately a reaction to recently published news aggregation.
Since news prediction is an impractical phenomenon, market
prices are always following an unpredictable trend. This hypothesis
implies that there is no possible solution to ’beat the market.
On the other hand, the AMH [4] is trying to find a correlation
between the evidential EMH and the extravagant behavioral finance
principles. Behavioral finance tries to describe the market trend
by psychology-based theories. Regarding the AMH, investors can
leverage the market efficiency weakness to gain profit from share
trading.
Relying on the AMH statement, there should be possible solutions
to predict the future of market behavior. Considering this fact,
along with the Dow theory [5], leads to the creation of two
basic stock market analysis principles: fundamental and technical.
Fundamental analysis tries to investigate a stock’s intrinsic
value by evaluating related factors such as the balance sheet,
micro-economic indicators, and consumer behavior. Whenever
the stock value computed by this strategy is higher/lower than
the market price, investors are attracted to buy/sell it. On the
other hand, the technical analysis only examines the stock’s
price history and makes the trading decisions based on the
mathematical indicators exploited from the stock price. These
indicators include relative strength index (RSI), moving average
convergence/divergence (MACD), and money flow index (MFI)
[6].
Decades ago, the proposed market analysis was performed by
financial analysts; but through the development of computing
power and artificial intelligence, this process could also be done
by data scientists. Nowadays, the power of ML strategies in
addressing the stock market prediction problem is strengthening
rapidly in both fundamental and technical analyses. In an early
study leveraging ML for stock market prediction, Piotroski et al.
[7] introduced an ML model called F-Score to evaluate companies’
actual share values. Their method was based on nine factors
exploited by a company’s financial reports divided into three main
categories: profitability, liquidity, and operating efficiency. They
implemented the F-Score algorithm on the historical companies’
financial reports of the U.S. stock market for twenty years
from 1976 to 1996 and presented remarkable outcomes. Some
years later, Mohanram et al. [8] proposed a developed ML
1
arXiv:2107.01031v1 [q-fin.ST] 30 Jun 2021
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
algorithm named G-Score to decide on trading stocks. Their
approach was based on fundamental analysis applying financial
reports to evaluate three criteria: profitability, naive extrapolation,
and accounting conservatism. They also showed their algorithm
sufficiency by back-testing the U.S. stock market trend between
1978 and 2001.
Basically, in the fundamental analysis, unlike the technical
analysis, the data is unstructured and hard to be processed for
training an ML model. Nevertheless, many studies by leveraging
this type of analysis proved it can lead to a rational prediction
of the market price. Contrarily, to analyze the market based on
a technical method, only historical price data is required. This
data is a structured type of data that is directly available to the
public. This resulted in a far higher volume of research papers
studying the prediction of the stock market based on the technical
analysis approaches. As one of the early studies in this field, at the
beginning of the 90s, Kimoto et al. [9] worked on a feed-forward
neural network (NN) algorithm [10] to predict the stock market
exploiting the historical financial indicators such as interest rate,
and foreign exchange rate. Their model was a decision-making
tool in generating the signal of buying/selling shares. Although
their model could be successful in the buy-and-hold strategy, it
could not predict the signal of selling sufficiently. Therefore, other
ML algorithms were examined to assess the prediction power of
ML using technical data such as artificial neural network (ANN),
random forest (RF), support vector machine (SVM), and naive
Bayesian. In [11], Patel et al. implemented four distinct ML
algorithms on this problem, including artificial neural network
(ANN), random forest (RF), support vector machine (SVM),
and naive Bayesian. Ten-year-period testbed results clarified that
the random forest algorithm can be more effective among other
algorithms, especially while the input data is discretized. Moreover,
in a very recent study, Zhong et al. [12] studied a comprehensive
big data analytic procedure applying ML to predict a daily return
in the stock market. They employed both deep neural network
(DNN) and ANN to fit and predict sixty financial input features
of the model. They concluded that the ANN performs better than
the DNN, and applying principle component analysis (PCA) in the
pre-processing step can improve prediction accuracy.
This paper attempts to investigate the effectiveness of AI,
particularly ML, in addressing stock market prediction. In this
research, both technical and fundamental stock market analyses are
applied to measure ML algorithms’ accuracy in predicting market
trends. Moreover, a multitude of ML algorithms such as logistic
regression, k-nearest neighbor, random forest, decision tree, and
ANN are employed to find the most accurate algorithms to be
selected as a solution to this problem. The data used to generate ML
models are acquired in real-time, and the purpose of this research
is to evaluate the accuracy of a selling/buying/holding signal of a
specific share for investors.
The remainder of the paper is organized as follows: Section II is
devoted to the problem description and motivations; Section III
contains methodology; Section IV measures the performance of
ML algorithms on a stock market testbed; Section V points out the
conclusion and future works.
2. PROBLEM DESCRIPTION AND MOTIVATION
Since the 90s, early studies attempted to predict the stock markets
leveraging AI strategies. Many research studies are published to
evaluate the performance of AI approaches in the stock market
prediction. Researchers’ enthusiasm in studying the stock market
prediction problem is due to the tremendous daily volume of traded
money in the stock markets.
Generally, analyzing the stock markets is based on two primary
strategies: technical analysis and fundamental analysis. In the first
one, stockholders try to evaluate the stock markets regarding the
historical price data and investigating the generated indicators
exploited from this data, such as the RSI and the MACD. An ML
model can do the same. It can be trained to find a logical pattern
between the financial indicators and the stock’s closing price. This
can lead to a prediction model that estimates the stock price at
the end of a business day. On the other hand, in the fundamental
analysis, stockholders attempt to calculate an actual stock value
based on its owner company’s financial reports, such as the market
cap or the dividends. If the estimated price value is higher than
the stock price, the stockholders receive a selling signal, while if
the estimated price is lower than the stock price, they receive a
holding/buying signal. It is evident that any changes in a company’s
financial report can immediately affect the public sentiment on
the news and social media. An ML model can investigate news
and social media through the Internet to predict a positive/negative
impact of the stock prices’ fundamental indicators. Then, provide
the action signal for the stockholders based on the public sentiment.
But, the question is how much these approaches can be effective
in the prediction of the market. In other words, Can AI beat the
stock market?”. This study is trying to employ the ML algorithms
and evaluate their performance in predicting the stock market
to answer this question. The regression models are employed to
predict the stock closing prices, and the classification models
are used to predict the action signal for stockholders. In the
following section, the methodology for addressing this evaluation
is explained.
3. METHODOLOGY
In this study, the stock market’s prediction, leveraging ML tools,
includes four main steps: dataset building, data engineering, model
training, and prediction. This section is devoted to explaining each
of these steps in detail.
3.1 Dataset
The first step of building an ML model is having access to a dataset.
This dataset includes some features that train the ML model. The
training procedure can be done with or without a set of labeled
data called target values. If the training is based on a set of labeled
data, the training procedure is called supervised learning; while,
unsupervised learning does not need any target values and tries to
find the hidden patterns in the training dataset.
In the problem of predicting the stock market, most datasets are
labeled. For instance, the dataset includes some financial indicators
such as RSI and MACD as features and the stock’s closing price
as the target value in the technical analysis approach. It is evident
that the data associated with the technical analysis is continuous
numbers, which is shown in time-series format data. On the other
hand, in the fundamental analysis strategy, the features are some
statements such as financial reports or investors’ sentiments, and
the target value is the signal of decision-making in buying/selling
the stock. In this type of analysis, the data includes typically
alphabetic inputs such as reports and sentiments.
Hopefully, most of the essential data required for this problem is
available online such as historical stock prices or public sentiments
in the news. The data employed in this study was acquired from two
2
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
sources: technical analysis data available on Y ahoo F inance1,
and sentiment analysis data available on T witter2. The Yahoo
Finance data includes the open, close, mid, high, low price, and
volume values without missing samples; while, the Twitter dataset
contains tweets from the public comprising news agencies and
individuals that can have missing samples.
3.2 Data Engineering
The data obtained from the proposed datasets requires to be
pre-processed before being exploited in model training. Either
technical analysis or fundamental analysis has several indicators
applied in the model training step, and the most significant ones are
explained in the following.
3.2.1 Technical Analysis. The historical stock prices are used to
calculate appropriate financial indicators such as simple moving
average (SMA), exponential moving average (EMA), RSI, MACD,
and on-balance-volume (OBV) to build the input features of an ML
training model. These indicators are explained in the following.
SMA. This indicator is the average of the most recent closing prices
of a stock in a particular period. The mathematical calculation of
the SMA is shown as below:
SM A(t, N ) =
N
X
k=1
CP (tk)
N(1)
where CP is the closing price, Nindicates the number of days
that the CP is evaluated, and kshows the days associated with a
particular CP .
EMA. This indicator tracks a stock price the same as the SMA,
but it pays more attention to the recent closing prices by weighting
them. Equation 2 indicates the weighting process of this indicator.
EM A(t, ∆) = (CP (t)E MA(t1)) Γ + E MA(t1)
Γ = 2
∆+1 ,∆ = Time period EMA
(2)
where tis the present day, is the number of days, and Γis the
smoothing factor.
MACD. This indicator tries to compare the short-term and the
long-term trends of a stock price. Equation 3 describes this
indicator as follow:
MACD =EM A(t, k)EMA(t, d)(3)
where kand dare the periods of short-term and long-term trends.
Normally, these values are considered as k= 12 and d= 16 days.
OBV. This indicator uses a stock volume flow to show the price
trend and indicates whether this volume is flowing in or out. The
following equation explains the OBV concept:
OBV =OBVpr +
volume, if CP > CP pr
0,if CP =C P pr
volume,if CP < CP pr
(4)
where OBVpr is the previous OBV, volume is the latest trading
volume amount, and CPpr is the previous closing price.
RSI. This indicator is measuring the oversold or the overbought
characteristic of a stock. Indeed, it shows the trend of
buying/selling a stock. The RSI is described as:
RSI =100
1 + RS(t), RS (t) = AvgGain(t)
AvgLoss(t)(5)
1https://finance.yahoo.com/
2https://twitter.com/
where RS(t)shows the rate of profitability of stock, AvgGain(t)
is the average gained profit of stock at time t, and AvgLoss(t)
indicates the average loss on that price.
3.2.2 Fundamental Analysis. Due to the unstructured nature of
fundamental indicators, extracting data for fundamental analysis
is not easy. However, the development of AI makes it possible
to exploit data from the Internet for this purpose, leading to
a more accurate stock market prediction. This data can be
information related to the financial report of a company or the
sentiment of investors. Literally, companies’ financial reports
instantly impact public sentiment and present themselves on social
media, particularly Twitter. Thus, one way of evaluating the impact
of fundamental data on market trends is by looking at public tweets.
This strategy is called sentiment analysis of the stock market.
In the sentiment analysis, the input data for training a model is
basically unstructured, imported as text format to the model. The
target of fundamental datasets is a binary value indicating the text’s
positive/negative impact on a specific stock.
Besides, based on the types of data, the pre-processing step differs.
In the technical analysis, due to the data’s numeric nature, it is
essential to normalize the data before employing them for model
training. The data normalization step is significant when the ML
model wants to find a logical pattern in the input data. If the data are
not on the same scale, the prediction process would not accurately
perform. Thus, many functions are applied to normalize the data,
such as MinMaxScaler, StandardScaler, and RobustScaler. In this
paper, MinMaxScaler is used to scale the data and is described as
below:
am
scaled =(am
iamin)
(amax amin)(6)
where am
iis the ith feature (indicator) from mth experiment (time
sample), amin and amax are the minimum and the maximum
values of the feature among the experiments, respectively. Also,
am
scaled indicates the scaled value for the ith feature of mth
experiment.
On the other hand, in the fundamental analysis, the data is not
numeric. The goal is to investigate the impact of a sentence –that
can be a tweet on Twitter– on public sentiment. Whenever using
non-numerical data in training an ML model, the input data should
be translated into numeric data. Thus, one way to do so is data
labeling.
Feature selection means finding the most valuable features that
lead to a more accurate ML model in a fewer computation time.
This technique can be classified as a filter, wrapper, embedded, and
hybrid methods [13]. In the filter method, correlation criterion plays
a significant role. Correlation is a measure of the linear relationship
between two or more parameters. In this method, features showing
the most correlation with the target are selected to build the model.
Furthermore, to avoid redundant computation, the selected features
should not be highly correlated to each other. To do so, the Pearson
correlation technique is one of the most useful methods, which is
described as below:
Corr(i) = cov (ai, b)
pvar (ai)var(b)(7)
where aiis the ith feature, bis the target label, cov() and var()
represent the covariance and the variance functions, respectively.
The processed data could be employed to train the ML model, as
shown in Fig. 1.
3
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
Fig. 1: The framework of model training to predict the stock market.
3.3 Machine Learning Model Training
Many ML algorithms have been employed to predict stock markets
in research studies. Basically, there are two main categories of
models to address this problem: classification models that try to
help the investors in the decision-making process of buying, selling,
or holding stock, and regression models that attempt to predict
stock price movements such as the closing price of a stock. In
research studies, over 90% of the algorithms leveraged in predicting
the stock market are classification models [14]. However, few
studies tried to predict the exact stock prices using the regression
models [15, 16, 17].
Among ML algorithms, the decision tree (DT), support vector
machine (SVM), and artificial neural networks (ANN) are the
most popular ones employed to predict stock markets [18]. In
this study, besides using the ANN, DT, and SVM models, logistic
regression (LR), Gaussian naive Bayes (GNB), Bernoulli Naive
Bayes (BNB), random forest (RF), k-nearest neighbor (KNN),
and XGboost (XGB) are employed for classification strategy;
moreover, linear regression and long short-term memory (LSTM)
are used in regression problems. In the following, these algorithms
are briefly explained.
ANN. Originally came from the concepts in biology and
consisted of various processing elements called neurons. These
inter-connected neurons’ task is majorly summing up the values of
input parameters corresponding to their specified weights and then
adding a bias. The number of neurons in the input should equal the
number of neurons in the output. In the end, the output values are
calculated after the transfer function is applied.
DT. Decision tree owns a structure similar to a tree, where each
branch represents the test outcome, and each leaf indicates a class
label. The structure also includes internal nodes, which represent
the test on a particular attribute. The outcome is a final decision that
provides the best fitting of calculated attributes of the best class.
SVM. In the SVM model, examples are mapped as separated points
in the space as vast as possible concerning each other. Hence, the
predicted examples are also mapped to the same space and then
categorized.
LR. Logistic Regression algorithm is one of the most suitable
algorithms in regression analysis, especially when the dependent
variable is binary, where a logistic function is leveraged for
modeling.
GNB, and BNB. Gaussian Naive Bayes and Bernoulli Naive Bayes
are considered supervised learning algorithms which are simple but
very functional. Gaussian Naive Bayes includes prior and posterior
probabilities of the dataset classes, while Bernoulli Naive Bayes
only applies to data with binary-valued variables.
RF. Random forest algorithm includes a series of decision trees
whose objective is to generate an uncorrelated group of trees whose
prediction is more accurate than any single tree in the group.
KNN. The KNN is a well-known algorithm for classification
problems, in which test data is used to determine what an
unclassified point should be classified as. Manhattan distance and
Euclidean distance are the methods that are used in this algorithm to
measure the distance of the unclassified point to its similar points.
XGB. A popular and open-source version of the gradient boosted
trees algorithm, XGBoost is a supervised learning algorithm for the
accurate prediction of an aimed variable based on its simpler and
weaker models estimation.
Linear Regression. A subset of supervised learning, Linear
Regression, is basically a first-order prediction, e.g., a line or a
plane that best fits the dataset’s data points. Any new point as the
prediction will be located on that line or plane.
LSTM. Unlike standard feed-forward neural networks, the Long
Term Short Memory algorithm owns feedback connections and is
utilized in deep learning. This algorithm is widely used to classify
problems and make predictions based on data in the time domain.
All the proposed algorithms are used to perform a stock market
prediction, and their performance is compared to evaluate the
sufficiency of ML in this problem. The following subsection
explains the metrics that are applied in the comparison procedure.
3.4 Model Evaluation Metrics
All prediction models require some evaluation metrics to
investigates their accuracy in the prediction procedure. In ML
algorithms, a multitude of metrics are available to measure the
4
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
models’ performance, including confusion matrix, and receiver
operator characteristic (ROC) curve for classification models; and
R-squared, explanation variation, mean absolute percentage error
(MAPE), root mean squared error (RMSE), and mean absolute
error (MAE) for regression [19]. The rest of this subsection is
devoted to explaining the concept of these metrics.
3.4.1 Confusion matrix. This measure evaluates the accuracy
of an ML model using a pre-known set of targeted data. Also,
some other metrics, including sensitivity, specificity, precision, and
F1-score, are resulted regarding this matrix. The sensitivity or recall
is the likelihood of predicting true positive, while the specificity
shows the true negative rate. Also, the precision indicates the
accuracy of the true positive predicted classes. The F1-Score
computes the balance between sensitivity and precision. Finally, the
accuracy of the model would be the evaluation of the true predicted
classes. Figure 2 shows the confusion matrix concept.
Fig. 2: Confusion matrix explanation.
3.4.2 ROC, AUC. The receiver operator characteristic (ROC)
curve includes two values: true-positive and false-positive rates.
The ROC investigates the classifiers’ performance among the
whole range of class distributions and error costs. ROC curves are
compared by the area under the curve (AUC) metric. The more
values of AUC mention more accurate predicted outputs [20].
3.4.3 R-squared (R2). The R2is a statistical measure indicating
the variance portion for a dependent variable that’s explained by
an independent variable or variables in a regression model. It is
also known as the coefficient of determination or the coefficient
of multiple determination for multiple regression. Using regression
analysis, higher R2is always better to explain changes in your
outcome variable. If the R-squared value is less than 0.3, this value
is generally considered a fragile effect size; if the R-squared value
is between 0.3 and 0.5, this value is generally considered a low
effect size; if the R-squared value is bigger than 0.7, this value
is generally considered strong effect size. The following equation
presents the formula for calculating the R2metric.
R2= 1 P(yiˆyi)2
P(yi¯y)2(8)
where yi, and ˆyiare the ith actual and predicted value, respectively,
and ¯yshows the mean of actual values.
3.4.4 Explanation Variation. The explained variance is used to
measure the discrepancy between a model and actual data. In
other words, it’s the part of the model’s total variance that is
explained by factors that are actually present and are not due to
error variance. The explained variation is the sum of the squared of
the differences between each predicted value and the mean of actual
values. Equation (9) shows the concept of explanation variation as
below:
EV =X( ˆyi¯y)2(9)
where EV is the explanation variation, ˆyiis the predicted value,
and ¯yindicates the mean of actual values.
3.4.5 MAPE. The MAPE is how far the model’s predictions are
off from their corresponding outputs on average. The MAPE is
asymmetric and reports higher errors if the prediction is more than
the actual value and lower errors when the prediction is less than the
actual value. Equation (10) explains the mathematical formulation
of this metric.
M AP E =1
n
n
X
i=1
yiˆyi
yi
(10)
where nis the number of experiments, ˆyiis the predicted value,
and yiis the actual value for the ith experiment.
3.4.6 RMSE. The computed standard deviation for prediction
errors in an ML model is called RMSE. The prediction error or
residual shows how far are the data from the regression line. Indeed,
RMSE is a measure of how spread out these residuals are [21]. In
other words, it shows how concentrated the data is around the line
of best fit, as shown in Equation (11). The smaller value of this
metric represents a better prediction of the model.
RMSE =sPn
i=1 yiyi)2
n(11)
where nis the number of experiments, ˆyiis the predicted value,
and yiis the actual value for the ith experiment.
3.4.7 MAE. The MAE is the sum of absolute differences between
the target and the predicted variables. Thus, it evaluates the
average magnitude of errors in a set of predictions without
considering their directions. The smaller values of this metric mean
a better prediction model. The following equation presents the
mathematical MAE formula.
MAE =1
n
n
X
i=1
(yiˆyi)(12)
where nis the number of experiments, ˆyiis the predicted value,
and yiis the actual value for the ith experiment.
Regarding the proposed framework, the performance of ML
algorithms on the prediction of stock markets can be evaluated.
The following section implements ML algorithms on the real-life
problem of the U.S. stock market prediction.
4. RESULTS AND DISCUSSION
This section tries to illustrate the performance of the proposed
methodology on the prediction of stock markets. For this, Python
software is used to train the ML models and predict unforeseen
5
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
(a) Prediction of the linear regression model.
(b) Prediction of the LSTM model.
Fig. 3: AAPL price prediction with the technical analysis approach.
Table 1. : Models performance comparison, in technical analysis approach.
Metric Linear Regression LSTM
R21.0 0.99
Explained Variation 1.0 0.99
MAPE 1.56 2.99
RMSE 1.82 3.42
MAE 1.18 2.3
data. First, the market prediction based on the technical analysis is
evaluated, and then the fundamental analysis is investigated in this
problem.
4.1 Technical Analysis Performance
In this paper, the dataset for building a predictor model based on
the technical analysis is exploited from the Y ahoo F inance
website. Indeed, it contains the historical data for a well-known
stock called AAP L, which indicates Apple company information
through a period of more than ten years between 2010 to 2021.
The dataset includes 60 features such as open, high, low prices,
the moving average, MACD, and RSI. The target is the close price,
representing the final price of AAP L at the end of a business day.
Then, the most correlated features to the target are selected, and
then the redundant features that show a high correlation together are
merged. Finally, the data is scaled by the MinMaxScaler function
explained in Section 3.
The dataset is divided into three parts of the training data, validation
data, and testing data to build the ML model. A large portion of
the data is devoted to the training process, and the rest belongs to
validation and testing. In the training process, the algorithm uses
the training data to learn how to predict the target value accessible
to the algorithm. Then, the model evaluates the performance of the
prediction regarding the validation data. Finally, it can predict the
unforeseen target of the testing dataset to compare with the true
target values. In the end, by using the predicted and actual values
of the closing price, the evaluation metrics can be measured. Table
6
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
Table 2. : Models performance comparison in fundamental analysis.
Metrics LR GNB BNB DT RF KNN SVM XGB ANN
Precision 0.729 0.636 0.644 0.620 0.727 0.684 0.757 0.710 0.684
Recall 0.727 0.634 0.644 0.620 0.727 0.684 0.755 0.709 0.684
F1-score 0.726 0.632 0.644 0.620 0.727 0.684 0.755 0.709 0.684
Accuracy 0.727 0.634 0.644 0.620 0.727 0.684 0.755 0.709 0.684
AUC 0.73 0.63 0.64 0.62 0.73 0.68 0.76 0.71 0.68
(a) ROC curves. (b) ROC curves from a closer view.
Fig. 4: ROC curves for classification algorithms.
1 shows the comparison of the evaluation metrics. Moreover, Fig. 3
shows the prediction of stock price based on the LR and the LSTM
algorithms.
Regarding Table 1, the LR model is far better in predicting the
AAP L closing price compared to the LSTM model. Moreover,
to illustrate the closing price’s predicted and actual values, Fig.
3 shows these values since 2018. The solid blue line shows the
predicted value, and the dashed green line is the actual one.
4.2 Fundamental Analysis Performance
In this paper, a set of public tweets associated with Apple company
is employed to generate the required dataset available at [22]. In
this case, the features are texts in T witter, and the target is a
binary value of impacted sentiment. If the content of the tweet
has a positive impact on the stock market, the sentiment value
would be 1, while a negative impact would give a -1 value to
the sentiment. Then, the impacts of tweets on the specific stock
are evaluated. Finally, the performance of ML in the prediction of
buying/selling/holding signal is investigated.
The dataset includes nearly 6000 tweets, and the pre-processing
data includes labeling the target values and employing the principle
component analysis (PCA) [23] to reduce features dimension that
show a high correlation. Then, the proposed algorithms in Section
3 are used to classify the outcome of the model by negative or
positive sentiment. Based on the evaluation metrics explained in the
previous section, the performance of ML algorithms is compared
and showed in Table 2. This table indicates that in this paper, the
prediction of the public sentiment using ML algorithms does not
show promising results. The most accurate algorithm is the SVM,
with an accuracy of 76%. Moreover, the performance of these
algorithms is illustrated in Fig. 4 that compares the ROC curves
and also shows the AUC for each algorithm. In this figure, the SVM
algorithm has the best AUC score.
5. CONCLUSION
This study tries to address the problem of stock market prediction
leveraging ML algorithms. To do so, two main categories of stock
market analysis (technical and fundamental) are considered. The
performance of ML algorithms on the forecast of the stock market
is investigated based on both of these categories. For this, labeled
datasets are used to train the supervised learning algorithms, and
evaluation metrics are employed to examine the accuracy of ML
algorithms in the prediction process. The results show that the
linear regression model predicts the closing price remarkably with
a shallow error value in the technical analysis. Moreover, in the
fundamental analysis, the SVM model can predict public sentiment
with an accuracy of 76%. These results imply that although AI
can predict the stock price trends or public sentiment about the
stock markets, its accuracy is not good enough. Furthermore, while
the linear regression can predict the closing price with a sensible
range of error, it cannot precisely predict the same value for the
next business day. Thus, this model is not sufficient for long-term
investments. On the other hand, the accuracy of classification
7
International Journal of Computer Applications (0975 - 8887)
Volume * - No.*, ——– 2018
algorithms in predicting buying, selling, or holding a stock is not
satisfying enough and can result in loss of capital.
Nevertheless, many research studies on this topic are leveraging
a hybrid model that employs both the technical analysis and the
fundamental analysis in one ML model to compensate for the
individual algorithms’ downsides. This could increase the accuracy
in the prediction process that implies an exciting topic for future
studies. Based on this study, it seems that AI is not close to the
prediction of the stock market with reliable accuracy. Maybe in the
future, with AI development and especially computation power, a
more precise model of stock market prediction can be available.
Still, so far, there is no reputable model that can beat the stock
market.
References
[1] Sohrab Mokhtari and Kang K Yen. “A Novel Bilateral
Fuzzy Adaptive Unscented Kalman Filter and its
Implementation to Nonlinear Systems with Additive
Noise”. In: 2020 IEEE Industry Applications Society
Annual Meeting. IEEE. 2020, pp. 1–6.
[2] Sohrab Mokhtari and Kang K Yen. “Impact of large-scale
wind power penetration on incentive of individual investors,
a supply function equilibrium approach”. In: Electric Power
Systems Research 194 (2021), p. 107014.
[3] Eugene F Fama. “Efficient capital markets: II”. In: The
journal of finance 46.5 (1991), pp. 1575–1617.
[4] Andrew W Lo. “The adaptive markets hypothesis”. In: The
Journal of Portfolio Management 30.5 (2004), pp. 15–29.
[5] From Charles D Kirkpatrick II and R Julie. “Dow Theory”.
In: CMT Level I 2019: An Introduction to Technical
Analysis (2019), p. 15.
[6] Robert D Edwards, WHC Bassetti, and John Magee.
Technical analysis of stock trends. CRC press, 2007.
[7] Joseph D Piotroski. “Value investing: The use of historical
financial statement information to separate winners from
losers”. In: Journal of Accounting Research (2000),
pp. 1–41.
[8] Partha S Mohanram. “Separating winners from losers
among lowbook-to-market stocks using financial statement
analysis”. In: Review of accounting studies 10.2-3 (2005),
pp. 133–170.
[9] Takashi Kimoto et al. “Stock market prediction system with
modular neural networks”. In: 1990 IJCNN international
joint conference on neural networks. IEEE. 1990, pp. 1–6.
[10] Alireza Abbaspour et al. “A Survey on Active
Fault-Tolerant Control Systems”. In: Electronics 9.9
(2020). IS SN: 2079-9292.
[11] Jigar Patel et al. “Predicting stock and stock price index
movement using trend deterministic data preparation and
machine learning techniques”. In: Expert systems with
applications 42.1 (2015), pp. 259–268.
[12] Xiao Zhong and David Enke. “Predicting the daily return
direction of the stock market using hybrid machine learning
algorithms”. In: Financial Innovation 5.1 (2019), p. 4.
[13] Girish Chandrashekar and Ferat Sahin. “A survey on
feature selection methods”. In: Computers & Electrical
Engineering 40.1 (2014), pp. 16–28.
[14] D Van Thanh, HN Minh, and DD Hieu. “Building
unconditional forecast model of Stock Market Indexes
using combined leading indicators and principal
components: application to Vietnamese Stock Market”. In:
Indian Journal of Science and Technology 11 (2018).
[15] Ahmad Kazem et al. “Support vector regression with
chaos-based firefly algorithm for stock market price
forecasting”. In: Applied soft computing 13.2 (2013),
pp. 947–958.
[16] Haiqin Yang, Laiwan Chan, and Irwin King. “Support
vector machine regression for volatile stock market
prediction”. In: International Conference on Intelligent
Data Engineering and Automated Learning. Springer.
2002, pp. 391–396.
[17] Riswan Efendi, Nureize Arbaiy, and Mustafa Mat Deris. “A
new procedure in stock market forecasting based on fuzzy
random auto-regression time series model”. In: Information
Sciences 441 (2018), pp. 113–132.
[18] Isaac Kofi Nti, Adebayo Felix Adekoya, and
Benjamin Asubam Weyori. “A systematic review of
fundamental and technical analysis of stock market
predictions”. In: Artificial Intelligence Review (2019),
pp. 1–51.
[19] Sohrab Mokhtari et al. “A Machine Learning Approach for
Anomaly Detection in Industrial Control Systems Based
on Measurement Data”. In: Electronics 10.4 (2021). ISS N:
2079-9292.
[20] Caren Marzban. “The ROC curve and the area under it as
performance measures”. In: Weather and Forecasting 19.6
(2004), pp. 1106–1114.
[21] Mohammad Abedin, Sohrab Mokhtari, and Armin B
Mehrabi. “Bridge damage detection using machine learning
algorithms”. In: Health Monitoring of Structural and
Biological Systems XV. Vol. 11593. International Society
for Optics and Photonics. 2021, 115932P.
[22] Yash Chaudhary. Stock-Market Sentiment Dataset. 2020.
DO I:10.34740/KAGGLE/DSV/1217821.
[23] Tom Howley et al. “The effect of principal component
analysis on machine learning accuracy with high
dimensional spectral data”. In: International Conference
on Innovative Techniques and Applications of Artificial
Intelligence. Springer. 2005, pp. 209–222.
8
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Attack detection problems in industrial control systems (ICSs) are commonly known as a network traffic monitoring scheme for detecting abnormal activities. However, a network-based intrusion detection system can be deceived by attackers that imitate the system’s normal activity. In this work, we proposed a novel solution to this problem based on measurement data in the supervisory control and data acquisition (SCADA) system. The proposed approach is called measurement intrusion detection system (MIDS), which enables the system to detect any abnormal activity in the system even if the attacker tries to conceal it in the system’s control layer. A supervised machine learning model is generated to classify normal and abnormal activities in an ICS to evaluate the MIDS performance. A hardware-in-the-loop (HIL) testbed is developed to simulate the power generation units and exploit the attack dataset. In the proposed approach, we applied several machine learning models on the dataset, which show remarkable performances in detecting the dataset’s anomalies, especially stealthy attacks. The results show that the random forest is performing better than other classifier algorithms in detecting anomalies based on measured data in the testbed.
Article
Full-text available
Faults and failures in the system components are two main reasons for the instability and the degradation in control performance. In recent decades, fault-tolerant control (FTC) approaches have been introduced to improve the resiliency of control systems against faults and failures. In general, FTC techniques are classified into active and passive approaches. This paper reviews fault and failure causes in control systems and discusses the latest solutions that are introduced to make the control system resilient. The recent achievements in fault detection and isolation (FDI) approaches and active FTC designs are investigated. Furthermore, a thorough comparison of several different aspects is conducted to understand the advantage and disadvantages of various FTC techniques to motivate researchers to further developing FTC and FDI approaches.
Article
Full-text available
Abstract Big data analytic techniques associated with machine learning algorithms are playing an increasingly important role in various application fields, including stock market investment. However, few studies have focused on forecasting daily stock market returns, especially when using powerful machine learning techniques, such as deep neural networks (DNNs), to perform the analyses. DNNs employ various deep learning algorithms based on the combination of network structure, activation function, and model parameters, with their performance depending on the format of the data representation. This paper presents a comprehensive big data analytics process to predict the daily return direction of the SPDR S&P 500 ETF (ticker symbol: SPY) based on 60 financial and economic features. DNNs and traditional artificial neural networks (ANNs) are then deployed over the entire preprocessed but untransformed dataset, along with two datasets transformed via principal component analysis (PCA), to predict the daily direction of future stock market index returns. While controlling for overfitting, a pattern for the classification accuracy of the DNNs is detected and demonstrated as the number of the hidden layers increases gradually from 12 to 1000. Moreover, a set of hypothesis testing procedures are implemented on the classification, and the simulation results show that the DNNs using two PCA-represented datasets give significantly higher classification accuracy than those using the entire untransformed dataset, as well as several other hybrid machine learning algorithms. In addition, the trading strategies guided by the DNN classification process based on PCA-represented data perform slightly better than the others tested, including in a comparison against two standard benchmarks.
Article
This paper investigates two problems: 1) the decision-making approach for an individual investor in a power market with the uncertainty of wind power generation, and 2) the impacts of expanding the penetration of wind energy on the profitability of the investment. To answer these questions, a comparison framework of two distinct investments is designed in which one is deterministic in a power market including only fossil-fuel generators, and another is uncertain in a power market with wind energy production. The power market is modeled as a supply function equilibrium (SFE), and a scenario-based model is implemented on the IEEE 30-bus system to deal with the uncertainty of wind energy. A novel approach, including error correction and the sensitivity analysis method, is introduced. The concept of conditional value at risk is applied to measure the investment risk. The results show that the wind speed prediction error and the wind probability density function variation create a very low impact on the final result. However, the load duration curve has a high impact on the decision-making problem. The proposed approach can address the ensured gained profit of an investment in an uncertain power market.
Article
Various models used in stock market forecasting presented have been classified according to the data preparation, forecasting methodology, performance evaluation, and performance measure. However, these models have not sufficiently discussed in data preparation to overcome randomness, as well as uncertainty and volatility of stock prices issues in achieving high forecasting accuracy. Therefore, the focus of this paper is the data preparation procedure of triangular fuzzy number to build an improved fuzzy random autoregression model using non-stationary stock market data for forecasting purposes. The improved forecasting model considers two types of input, which are data with low-high and single point values of stock market prices. Even though, low-high data present variability and volatility in nature, the single data has to be form in symmetry left-right spread to present variability and standard error. Then, expectations and variances, confidence intervals of fuzzy random data are constructed for fuzzy input-output data. By using the input-output data and simplex approach, parameters of the model can be estimated. In this study, some real data sets were used to represent both types of inputs, which are the Kuala Lumpur stock exchange and Alabama University enrollment. The study found that variability and spread adjustment are important factors in data preparation to improve accuracy of the fuzzy random auto-regression model.
Article
This paper addresses problem of predicting direction of movement of stock and stock price index for Indian stock markets. The study compares four prediction models, Artificial Neural Network (ANN), Support Vector Machine (SVM), random forest and naive-Bayes with two approaches for input to these models. The first approach for input data involves computation of ten technical parameters using stock trading data (open, high, low & close prices) while the second approach focuses on representing these technical parameters as trend deterministic data. Accuracy of each of the prediction models for each of the two input approaches is evaluated. Evaluation is carried out on 10 years of historical data from 2003 to 2012 of two stocks namely Reliance Industries and Infosys Ltd. and two stock price indices CNX Nifty and S&P Bombay Stock Exchange (BSE) Sensex. The experimental results suggest that for the first approach of input data where ten technical parameters are represented as continuous values, random forest outperforms other three prediction models on overall performance. Experimental results also show that the performance of all the prediction models improve when these technical parameters are represented as trend deterministic data.
Article
Plenty of feature selection methods are available in literature due to the availability of data with hundreds of variables leading to data with very high dimension. Feature selection methods provides us a way of reducing computation time, improving prediction performance, and a better understanding of the data in machine learning or pattern recognition applications. In this paper we provide an overview of some of the methods present in literature. The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems. We focus on Filter, Wrapper and Embedded methods. We also apply some of the feature selection techniques on standard datasets to demonstrate the applicability of feature selection techniques.