Available via license: CC BY 4.0

Content may be subject to copyright.

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

Effectiveness of Artiﬁcial Intelligence in Stock Market

Prediction Based on Machine Learning

Sohrab Mokhtari

Electrical and Computer Engineering

Florida International University

Miami, USA.

somokhta@ﬁu.edu

Kang K Yen

Electrical and Computer Engineering

Florida International University

Miami, USA.

yenk@ﬁu.edu

Jin Liu

Electrical and Computer Engineering

Florida International University

Miami, USA.

jiliu@ﬁu.edu

ABSTRACT

This paper tries to address the problem of stock market prediction

leveraging artiﬁcial intelligence (AI) strategies. The stock market

prediction can be modeled based on two principal analyses called

technical and fundamental. In the technical analysis approach,

the regression machine learning (ML) algorithms are employed

to predict the stock price trend at the end of a business day

based on the historical price data. In contrast, in the fundamental

analysis, the classiﬁcation ML algorithms are applied to classify the

public sentiment based on news and social media. In the technical

analysis, the historical price data is exploited from Yahoo Finance,

and in fundamental analysis, public tweets on Twitter associated

with the stock market are investigated to assess the impact of

sentiments on the stock market’s forecast. The results show a

median performance, implying that with the current technology of

AI, it is too soon to claim AI can beat the stock markets.

Keywords

Machine learning, time series prediction, technical analysis,

sentiment embedding, ﬁnancial market.

1. INTRODUCTION

Stock markets are always an attractive investment way to grow

capital. With the development of communication technology, the

stock markets are getting more popular among individual investors

in recent decades. While year by year, the number of shareholders

and companies is growing in the stock markets, many try to

ﬁnd a solution to predict a stock market’s future trend. This

is a challenging problem with a multitude of complex factors

that are impacting the price changes. Here, prediction algorithms

such as Kalman ﬁlter [1] and optimization methods such as Nash

equilibrium [2] can be helpful; but, for this speciﬁc problem, AI can

play a signiﬁcant role. For this, ML methods are developed in many

research papers to evaluate the prediction power of AI in the stock

markets. The ML algorithms that are implemented for this purpose

mostly try to ﬁgure out patterns of data, measure the investment

risk, or predict the investment future.

This ﬁeld’s efforts have led to two central theoretical hypotheses:

Efﬁcient Market Hypothesis (EHM) and Adaptive Market

Hypothesis (AMH). The EMH [3] claims that the spot market price

is ultimately a reaction to recently published news aggregation.

Since news prediction is an impractical phenomenon, market

prices are always following an unpredictable trend. This hypothesis

implies that there is no possible solution to ’beat the market.’

On the other hand, the AMH [4] is trying to ﬁnd a correlation

between the evidential EMH and the extravagant behavioral ﬁnance

principles. Behavioral ﬁnance tries to describe the market trend

by psychology-based theories. Regarding the AMH, investors can

leverage the market efﬁciency weakness to gain proﬁt from share

trading.

Relying on the AMH statement, there should be possible solutions

to predict the future of market behavior. Considering this fact,

along with the Dow theory [5], leads to the creation of two

basic stock market analysis principles: fundamental and technical.

Fundamental analysis tries to investigate a stock’s intrinsic

value by evaluating related factors such as the balance sheet,

micro-economic indicators, and consumer behavior. Whenever

the stock value computed by this strategy is higher/lower than

the market price, investors are attracted to buy/sell it. On the

other hand, the technical analysis only examines the stock’s

price history and makes the trading decisions based on the

mathematical indicators exploited from the stock price. These

indicators include relative strength index (RSI), moving average

convergence/divergence (MACD), and money ﬂow index (MFI)

[6].

Decades ago, the proposed market analysis was performed by

ﬁnancial analysts; but through the development of computing

power and artiﬁcial intelligence, this process could also be done

by data scientists. Nowadays, the power of ML strategies in

addressing the stock market prediction problem is strengthening

rapidly in both fundamental and technical analyses. In an early

study leveraging ML for stock market prediction, Piotroski et al.

[7] introduced an ML model called F-Score to evaluate companies’

actual share values. Their method was based on nine factors

exploited by a company’s ﬁnancial reports divided into three main

categories: proﬁtability, liquidity, and operating efﬁciency. They

implemented the F-Score algorithm on the historical companies’

ﬁnancial reports of the U.S. stock market for twenty years

from 1976 to 1996 and presented remarkable outcomes. Some

years later, Mohanram et al. [8] proposed a developed ML

1

arXiv:2107.01031v1 [q-fin.ST] 30 Jun 2021

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

algorithm named G-Score to decide on trading stocks. Their

approach was based on fundamental analysis applying ﬁnancial

reports to evaluate three criteria: proﬁtability, naive extrapolation,

and accounting conservatism. They also showed their algorithm

sufﬁciency by back-testing the U.S. stock market trend between

1978 and 2001.

Basically, in the fundamental analysis, unlike the technical

analysis, the data is unstructured and hard to be processed for

training an ML model. Nevertheless, many studies by leveraging

this type of analysis proved it can lead to a rational prediction

of the market price. Contrarily, to analyze the market based on

a technical method, only historical price data is required. This

data is a structured type of data that is directly available to the

public. This resulted in a far higher volume of research papers

studying the prediction of the stock market based on the technical

analysis approaches. As one of the early studies in this ﬁeld, at the

beginning of the 90s, Kimoto et al. [9] worked on a feed-forward

neural network (NN) algorithm [10] to predict the stock market

exploiting the historical ﬁnancial indicators such as interest rate,

and foreign exchange rate. Their model was a decision-making

tool in generating the signal of buying/selling shares. Although

their model could be successful in the buy-and-hold strategy, it

could not predict the signal of selling sufﬁciently. Therefore, other

ML algorithms were examined to assess the prediction power of

ML using technical data such as artiﬁcial neural network (ANN),

random forest (RF), support vector machine (SVM), and naive

Bayesian. In [11], Patel et al. implemented four distinct ML

algorithms on this problem, including artiﬁcial neural network

(ANN), random forest (RF), support vector machine (SVM),

and naive Bayesian. Ten-year-period testbed results clariﬁed that

the random forest algorithm can be more effective among other

algorithms, especially while the input data is discretized. Moreover,

in a very recent study, Zhong et al. [12] studied a comprehensive

big data analytic procedure applying ML to predict a daily return

in the stock market. They employed both deep neural network

(DNN) and ANN to ﬁt and predict sixty ﬁnancial input features

of the model. They concluded that the ANN performs better than

the DNN, and applying principle component analysis (PCA) in the

pre-processing step can improve prediction accuracy.

This paper attempts to investigate the effectiveness of AI,

particularly ML, in addressing stock market prediction. In this

research, both technical and fundamental stock market analyses are

applied to measure ML algorithms’ accuracy in predicting market

trends. Moreover, a multitude of ML algorithms such as logistic

regression, k-nearest neighbor, random forest, decision tree, and

ANN are employed to ﬁnd the most accurate algorithms to be

selected as a solution to this problem. The data used to generate ML

models are acquired in real-time, and the purpose of this research

is to evaluate the accuracy of a selling/buying/holding signal of a

speciﬁc share for investors.

The remainder of the paper is organized as follows: Section II is

devoted to the problem description and motivations; Section III

contains methodology; Section IV measures the performance of

ML algorithms on a stock market testbed; Section V points out the

conclusion and future works.

2. PROBLEM DESCRIPTION AND MOTIVATION

Since the 90s, early studies attempted to predict the stock markets

leveraging AI strategies. Many research studies are published to

evaluate the performance of AI approaches in the stock market

prediction. Researchers’ enthusiasm in studying the stock market

prediction problem is due to the tremendous daily volume of traded

money in the stock markets.

Generally, analyzing the stock markets is based on two primary

strategies: technical analysis and fundamental analysis. In the ﬁrst

one, stockholders try to evaluate the stock markets regarding the

historical price data and investigating the generated indicators

exploited from this data, such as the RSI and the MACD. An ML

model can do the same. It can be trained to ﬁnd a logical pattern

between the ﬁnancial indicators and the stock’s closing price. This

can lead to a prediction model that estimates the stock price at

the end of a business day. On the other hand, in the fundamental

analysis, stockholders attempt to calculate an actual stock value

based on its owner company’s ﬁnancial reports, such as the market

cap or the dividends. If the estimated price value is higher than

the stock price, the stockholders receive a selling signal, while if

the estimated price is lower than the stock price, they receive a

holding/buying signal. It is evident that any changes in a company’s

ﬁnancial report can immediately affect the public sentiment on

the news and social media. An ML model can investigate news

and social media through the Internet to predict a positive/negative

impact of the stock prices’ fundamental indicators. Then, provide

the action signal for the stockholders based on the public sentiment.

But, the question is how much these approaches can be effective

in the prediction of the market. In other words, ”Can AI beat the

stock market?”. This study is trying to employ the ML algorithms

and evaluate their performance in predicting the stock market

to answer this question. The regression models are employed to

predict the stock closing prices, and the classiﬁcation models

are used to predict the action signal for stockholders. In the

following section, the methodology for addressing this evaluation

is explained.

3. METHODOLOGY

In this study, the stock market’s prediction, leveraging ML tools,

includes four main steps: dataset building, data engineering, model

training, and prediction. This section is devoted to explaining each

of these steps in detail.

3.1 Dataset

The ﬁrst step of building an ML model is having access to a dataset.

This dataset includes some features that train the ML model. The

training procedure can be done with or without a set of labeled

data called target values. If the training is based on a set of labeled

data, the training procedure is called supervised learning; while,

unsupervised learning does not need any target values and tries to

ﬁnd the hidden patterns in the training dataset.

In the problem of predicting the stock market, most datasets are

labeled. For instance, the dataset includes some ﬁnancial indicators

such as RSI and MACD as features and the stock’s closing price

as the target value in the technical analysis approach. It is evident

that the data associated with the technical analysis is continuous

numbers, which is shown in time-series format data. On the other

hand, in the fundamental analysis strategy, the features are some

statements such as ﬁnancial reports or investors’ sentiments, and

the target value is the signal of decision-making in buying/selling

the stock. In this type of analysis, the data includes typically

alphabetic inputs such as reports and sentiments.

Hopefully, most of the essential data required for this problem is

available online such as historical stock prices or public sentiments

in the news. The data employed in this study was acquired from two

2

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

sources: technical analysis data available on Y ahoo F inance1,

and sentiment analysis data available on T witter2. The Yahoo

Finance data includes the open, close, mid, high, low price, and

volume values without missing samples; while, the Twitter dataset

contains tweets from the public comprising news agencies and

individuals that can have missing samples.

3.2 Data Engineering

The data obtained from the proposed datasets requires to be

pre-processed before being exploited in model training. Either

technical analysis or fundamental analysis has several indicators

applied in the model training step, and the most signiﬁcant ones are

explained in the following.

3.2.1 Technical Analysis. The historical stock prices are used to

calculate appropriate ﬁnancial indicators such as simple moving

average (SMA), exponential moving average (EMA), RSI, MACD,

and on-balance-volume (OBV) to build the input features of an ML

training model. These indicators are explained in the following.

SMA. This indicator is the average of the most recent closing prices

of a stock in a particular period. The mathematical calculation of

the SMA is shown as below:

SM A(t, N ) =

N

X

k=1

CP (t−k)

N(1)

where CP is the closing price, Nindicates the number of days

that the CP is evaluated, and kshows the days associated with a

particular CP .

EMA. This indicator tracks a stock price the same as the SMA,

but it pays more attention to the recent closing prices by weighting

them. Equation 2 indicates the weighting process of this indicator.

EM A(t, ∆) = (CP (t)−E MA(t−1)) ∗Γ + E MA(t−1)

Γ = 2

∆+1 ,∆ = Time period EMA

(2)

where tis the present day, ∆is the number of days, and Γis the

smoothing factor.

MACD. This indicator tries to compare the short-term and the

long-term trends of a stock price. Equation 3 describes this

indicator as follow:

MACD =EM A(t, k)−EMA(t, d)(3)

where kand dare the periods of short-term and long-term trends.

Normally, these values are considered as k= 12 and d= 16 days.

OBV. This indicator uses a stock volume ﬂow to show the price

trend and indicates whether this volume is ﬂowing in or out. The

following equation explains the OBV concept:

OBV =OBVpr +

volume, if CP > CP pr

0,if CP =C P pr

−volume,if CP < CP pr

(4)

where OBVpr is the previous OBV, volume is the latest trading

volume amount, and CPpr is the previous closing price.

RSI. This indicator is measuring the oversold or the overbought

characteristic of a stock. Indeed, it shows the trend of

buying/selling a stock. The RSI is described as:

RSI =100

1 + RS(t), RS (t) = AvgGain(t)

AvgLoss(t)(5)

1https://ﬁnance.yahoo.com/

2https://twitter.com/

where RS(t)shows the rate of proﬁtability of stock, AvgGain(t)

is the average gained proﬁt of stock at time t, and AvgLoss(t)

indicates the average loss on that price.

3.2.2 Fundamental Analysis. Due to the unstructured nature of

fundamental indicators, extracting data for fundamental analysis

is not easy. However, the development of AI makes it possible

to exploit data from the Internet for this purpose, leading to

a more accurate stock market prediction. This data can be

information related to the ﬁnancial report of a company or the

sentiment of investors. Literally, companies’ ﬁnancial reports

instantly impact public sentiment and present themselves on social

media, particularly Twitter. Thus, one way of evaluating the impact

of fundamental data on market trends is by looking at public tweets.

This strategy is called sentiment analysis of the stock market.

In the sentiment analysis, the input data for training a model is

basically unstructured, imported as text format to the model. The

target of fundamental datasets is a binary value indicating the text’s

positive/negative impact on a speciﬁc stock.

Besides, based on the types of data, the pre-processing step differs.

In the technical analysis, due to the data’s numeric nature, it is

essential to normalize the data before employing them for model

training. The data normalization step is signiﬁcant when the ML

model wants to ﬁnd a logical pattern in the input data. If the data are

not on the same scale, the prediction process would not accurately

perform. Thus, many functions are applied to normalize the data,

such as MinMaxScaler, StandardScaler, and RobustScaler. In this

paper, MinMaxScaler is used to scale the data and is described as

below:

am

scaled =(am

i−amin)

(amax −amin)(6)

where am

iis the ith feature (indicator) from mth experiment (time

sample), amin and amax are the minimum and the maximum

values of the feature among the experiments, respectively. Also,

am

scaled indicates the scaled value for the ith feature of mth

experiment.

On the other hand, in the fundamental analysis, the data is not

numeric. The goal is to investigate the impact of a sentence –that

can be a tweet on Twitter– on public sentiment. Whenever using

non-numerical data in training an ML model, the input data should

be translated into numeric data. Thus, one way to do so is data

labeling.

Feature selection means ﬁnding the most valuable features that

lead to a more accurate ML model in a fewer computation time.

This technique can be classiﬁed as a ﬁlter, wrapper, embedded, and

hybrid methods [13]. In the ﬁlter method, correlation criterion plays

a signiﬁcant role. Correlation is a measure of the linear relationship

between two or more parameters. In this method, features showing

the most correlation with the target are selected to build the model.

Furthermore, to avoid redundant computation, the selected features

should not be highly correlated to each other. To do so, the Pearson

correlation technique is one of the most useful methods, which is

described as below:

Corr(i) = cov (ai, b)

pvar (ai)∗var(b)(7)

where aiis the ith feature, bis the target label, cov() and var()

represent the covariance and the variance functions, respectively.

The processed data could be employed to train the ML model, as

shown in Fig. 1.

3

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

Fig. 1: The framework of model training to predict the stock market.

3.3 Machine Learning Model Training

Many ML algorithms have been employed to predict stock markets

in research studies. Basically, there are two main categories of

models to address this problem: classiﬁcation models that try to

help the investors in the decision-making process of buying, selling,

or holding stock, and regression models that attempt to predict

stock price movements such as the closing price of a stock. In

research studies, over 90% of the algorithms leveraged in predicting

the stock market are classiﬁcation models [14]. However, few

studies tried to predict the exact stock prices using the regression

models [15, 16, 17].

Among ML algorithms, the decision tree (DT), support vector

machine (SVM), and artiﬁcial neural networks (ANN) are the

most popular ones employed to predict stock markets [18]. In

this study, besides using the ANN, DT, and SVM models, logistic

regression (LR), Gaussian naive Bayes (GNB), Bernoulli Naive

Bayes (BNB), random forest (RF), k-nearest neighbor (KNN),

and XGboost (XGB) are employed for classiﬁcation strategy;

moreover, linear regression and long short-term memory (LSTM)

are used in regression problems. In the following, these algorithms

are brieﬂy explained.

ANN. Originally came from the concepts in biology and

consisted of various processing elements called neurons. These

inter-connected neurons’ task is majorly summing up the values of

input parameters corresponding to their speciﬁed weights and then

adding a bias. The number of neurons in the input should equal the

number of neurons in the output. In the end, the output values are

calculated after the transfer function is applied.

DT. Decision tree owns a structure similar to a tree, where each

branch represents the test outcome, and each leaf indicates a class

label. The structure also includes internal nodes, which represent

the test on a particular attribute. The outcome is a ﬁnal decision that

provides the best ﬁtting of calculated attributes of the best class.

SVM. In the SVM model, examples are mapped as separated points

in the space as vast as possible concerning each other. Hence, the

predicted examples are also mapped to the same space and then

categorized.

LR. Logistic Regression algorithm is one of the most suitable

algorithms in regression analysis, especially when the dependent

variable is binary, where a logistic function is leveraged for

modeling.

GNB, and BNB. Gaussian Naive Bayes and Bernoulli Naive Bayes

are considered supervised learning algorithms which are simple but

very functional. Gaussian Naive Bayes includes prior and posterior

probabilities of the dataset classes, while Bernoulli Naive Bayes

only applies to data with binary-valued variables.

RF. Random forest algorithm includes a series of decision trees

whose objective is to generate an uncorrelated group of trees whose

prediction is more accurate than any single tree in the group.

KNN. The KNN is a well-known algorithm for classiﬁcation

problems, in which test data is used to determine what an

unclassiﬁed point should be classiﬁed as. Manhattan distance and

Euclidean distance are the methods that are used in this algorithm to

measure the distance of the unclassiﬁed point to its similar points.

XGB. A popular and open-source version of the gradient boosted

trees algorithm, XGBoost is a supervised learning algorithm for the

accurate prediction of an aimed variable based on its simpler and

weaker models estimation.

Linear Regression. A subset of supervised learning, Linear

Regression, is basically a ﬁrst-order prediction, e.g., a line or a

plane that best ﬁts the dataset’s data points. Any new point as the

prediction will be located on that line or plane.

LSTM. Unlike standard feed-forward neural networks, the Long

Term Short Memory algorithm owns feedback connections and is

utilized in deep learning. This algorithm is widely used to classify

problems and make predictions based on data in the time domain.

All the proposed algorithms are used to perform a stock market

prediction, and their performance is compared to evaluate the

sufﬁciency of ML in this problem. The following subsection

explains the metrics that are applied in the comparison procedure.

3.4 Model Evaluation Metrics

All prediction models require some evaluation metrics to

investigates their accuracy in the prediction procedure. In ML

algorithms, a multitude of metrics are available to measure the

4

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

models’ performance, including confusion matrix, and receiver

operator characteristic (ROC) curve for classiﬁcation models; and

R-squared, explanation variation, mean absolute percentage error

(MAPE), root mean squared error (RMSE), and mean absolute

error (MAE) for regression [19]. The rest of this subsection is

devoted to explaining the concept of these metrics.

3.4.1 Confusion matrix. This measure evaluates the accuracy

of an ML model using a pre-known set of targeted data. Also,

some other metrics, including sensitivity, speciﬁcity, precision, and

F1-score, are resulted regarding this matrix. The sensitivity or recall

is the likelihood of predicting true positive, while the speciﬁcity

shows the true negative rate. Also, the precision indicates the

accuracy of the true positive predicted classes. The F1-Score

computes the balance between sensitivity and precision. Finally, the

accuracy of the model would be the evaluation of the true predicted

classes. Figure 2 shows the confusion matrix concept.

Fig. 2: Confusion matrix explanation.

3.4.2 ROC, AUC. The receiver operator characteristic (ROC)

curve includes two values: true-positive and false-positive rates.

The ROC investigates the classiﬁers’ performance among the

whole range of class distributions and error costs. ROC curves are

compared by the area under the curve (AUC) metric. The more

values of AUC mention more accurate predicted outputs [20].

3.4.3 R-squared (R2). The R2is a statistical measure indicating

the variance portion for a dependent variable that’s explained by

an independent variable or variables in a regression model. It is

also known as the coefﬁcient of determination or the coefﬁcient

of multiple determination for multiple regression. Using regression

analysis, higher R2is always better to explain changes in your

outcome variable. If the R-squared value is less than 0.3, this value

is generally considered a fragile effect size; if the R-squared value

is between 0.3 and 0.5, this value is generally considered a low

effect size; if the R-squared value is bigger than 0.7, this value

is generally considered strong effect size. The following equation

presents the formula for calculating the R2metric.

R2= 1 −P(yi−ˆyi)2

P(yi−¯y)2(8)

where yi, and ˆyiare the ith actual and predicted value, respectively,

and ¯yshows the mean of actual values.

3.4.4 Explanation Variation. The explained variance is used to

measure the discrepancy between a model and actual data. In

other words, it’s the part of the model’s total variance that is

explained by factors that are actually present and are not due to

error variance. The explained variation is the sum of the squared of

the differences between each predicted value and the mean of actual

values. Equation (9) shows the concept of explanation variation as

below:

EV =X( ˆyi−¯y)2(9)

where EV is the explanation variation, ˆyiis the predicted value,

and ¯yindicates the mean of actual values.

3.4.5 MAPE. The MAPE is how far the model’s predictions are

off from their corresponding outputs on average. The MAPE is

asymmetric and reports higher errors if the prediction is more than

the actual value and lower errors when the prediction is less than the

actual value. Equation (10) explains the mathematical formulation

of this metric.

M AP E =1

n

n

X

i=1

yi−ˆyi

yi

(10)

where nis the number of experiments, ˆyiis the predicted value,

and yiis the actual value for the ith experiment.

3.4.6 RMSE. The computed standard deviation for prediction

errors in an ML model is called RMSE. The prediction error or

residual shows how far are the data from the regression line. Indeed,

RMSE is a measure of how spread out these residuals are [21]. In

other words, it shows how concentrated the data is around the line

of best ﬁt, as shown in Equation (11). The smaller value of this

metric represents a better prediction of the model.

RMSE =sPn

i=1 (ˆyi−yi)2

n(11)

where nis the number of experiments, ˆyiis the predicted value,

and yiis the actual value for the ith experiment.

3.4.7 MAE. The MAE is the sum of absolute differences between

the target and the predicted variables. Thus, it evaluates the

average magnitude of errors in a set of predictions without

considering their directions. The smaller values of this metric mean

a better prediction model. The following equation presents the

mathematical MAE formula.

MAE =1

n

n

X

i=1

(yi−ˆyi)(12)

where nis the number of experiments, ˆyiis the predicted value,

and yiis the actual value for the ith experiment.

Regarding the proposed framework, the performance of ML

algorithms on the prediction of stock markets can be evaluated.

The following section implements ML algorithms on the real-life

problem of the U.S. stock market prediction.

4. RESULTS AND DISCUSSION

This section tries to illustrate the performance of the proposed

methodology on the prediction of stock markets. For this, Python

software is used to train the ML models and predict unforeseen

5

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

(a) Prediction of the linear regression model.

(b) Prediction of the LSTM model.

Fig. 3: AAPL price prediction with the technical analysis approach.

Table 1. : Models performance comparison, in technical analysis approach.

Metric Linear Regression LSTM

R21.0 0.99

Explained Variation 1.0 0.99

MAPE 1.56 2.99

RMSE 1.82 3.42

MAE 1.18 2.3

data. First, the market prediction based on the technical analysis is

evaluated, and then the fundamental analysis is investigated in this

problem.

4.1 Technical Analysis Performance

In this paper, the dataset for building a predictor model based on

the technical analysis is exploited from the ”Y ahoo F inance”

website. Indeed, it contains the historical data for a well-known

stock called AAP L, which indicates Apple company information

through a period of more than ten years between 2010 to 2021.

The dataset includes 60 features such as open, high, low prices,

the moving average, MACD, and RSI. The target is the close price,

representing the ﬁnal price of AAP L at the end of a business day.

Then, the most correlated features to the target are selected, and

then the redundant features that show a high correlation together are

merged. Finally, the data is scaled by the MinMaxScaler function

explained in Section 3.

The dataset is divided into three parts of the training data, validation

data, and testing data to build the ML model. A large portion of

the data is devoted to the training process, and the rest belongs to

validation and testing. In the training process, the algorithm uses

the training data to learn how to predict the target value accessible

to the algorithm. Then, the model evaluates the performance of the

prediction regarding the validation data. Finally, it can predict the

unforeseen target of the testing dataset to compare with the true

target values. In the end, by using the predicted and actual values

of the closing price, the evaluation metrics can be measured. Table

6

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

Table 2. : Models performance comparison in fundamental analysis.

Metrics LR GNB BNB DT RF KNN SVM XGB ANN

Precision 0.729 0.636 0.644 0.620 0.727 0.684 0.757 0.710 0.684

Recall 0.727 0.634 0.644 0.620 0.727 0.684 0.755 0.709 0.684

F1-score 0.726 0.632 0.644 0.620 0.727 0.684 0.755 0.709 0.684

Accuracy 0.727 0.634 0.644 0.620 0.727 0.684 0.755 0.709 0.684

AUC 0.73 0.63 0.64 0.62 0.73 0.68 0.76 0.71 0.68

(a) ROC curves. (b) ROC curves from a closer view.

Fig. 4: ROC curves for classiﬁcation algorithms.

1 shows the comparison of the evaluation metrics. Moreover, Fig. 3

shows the prediction of stock price based on the LR and the LSTM

algorithms.

Regarding Table 1, the LR model is far better in predicting the

AAP L closing price compared to the LSTM model. Moreover,

to illustrate the closing price’s predicted and actual values, Fig.

3 shows these values since 2018. The solid blue line shows the

predicted value, and the dashed green line is the actual one.

4.2 Fundamental Analysis Performance

In this paper, a set of public tweets associated with Apple company

is employed to generate the required dataset available at [22]. In

this case, the features are texts in T witter, and the target is a

binary value of impacted sentiment. If the content of the tweet

has a positive impact on the stock market, the sentiment value

would be 1, while a negative impact would give a -1 value to

the sentiment. Then, the impacts of tweets on the speciﬁc stock

are evaluated. Finally, the performance of ML in the prediction of

buying/selling/holding signal is investigated.

The dataset includes nearly 6000 tweets, and the pre-processing

data includes labeling the target values and employing the principle

component analysis (PCA) [23] to reduce features dimension that

show a high correlation. Then, the proposed algorithms in Section

3 are used to classify the outcome of the model by negative or

positive sentiment. Based on the evaluation metrics explained in the

previous section, the performance of ML algorithms is compared

and showed in Table 2. This table indicates that in this paper, the

prediction of the public sentiment using ML algorithms does not

show promising results. The most accurate algorithm is the SVM,

with an accuracy of 76%. Moreover, the performance of these

algorithms is illustrated in Fig. 4 that compares the ROC curves

and also shows the AUC for each algorithm. In this ﬁgure, the SVM

algorithm has the best AUC score.

5. CONCLUSION

This study tries to address the problem of stock market prediction

leveraging ML algorithms. To do so, two main categories of stock

market analysis (technical and fundamental) are considered. The

performance of ML algorithms on the forecast of the stock market

is investigated based on both of these categories. For this, labeled

datasets are used to train the supervised learning algorithms, and

evaluation metrics are employed to examine the accuracy of ML

algorithms in the prediction process. The results show that the

linear regression model predicts the closing price remarkably with

a shallow error value in the technical analysis. Moreover, in the

fundamental analysis, the SVM model can predict public sentiment

with an accuracy of 76%. These results imply that although AI

can predict the stock price trends or public sentiment about the

stock markets, its accuracy is not good enough. Furthermore, while

the linear regression can predict the closing price with a sensible

range of error, it cannot precisely predict the same value for the

next business day. Thus, this model is not sufﬁcient for long-term

investments. On the other hand, the accuracy of classiﬁcation

7

International Journal of Computer Applications (0975 - 8887)

Volume * - No.*, ——– 2018

algorithms in predicting buying, selling, or holding a stock is not

satisfying enough and can result in loss of capital.

Nevertheless, many research studies on this topic are leveraging

a hybrid model that employs both the technical analysis and the

fundamental analysis in one ML model to compensate for the

individual algorithms’ downsides. This could increase the accuracy

in the prediction process that implies an exciting topic for future

studies. Based on this study, it seems that AI is not close to the

prediction of the stock market with reliable accuracy. Maybe in the

future, with AI development and especially computation power, a

more precise model of stock market prediction can be available.

Still, so far, there is no reputable model that can beat the stock

market.

References

[1] Sohrab Mokhtari and Kang K Yen. “A Novel Bilateral

Fuzzy Adaptive Unscented Kalman Filter and its

Implementation to Nonlinear Systems with Additive

Noise”. In: 2020 IEEE Industry Applications Society

Annual Meeting. IEEE. 2020, pp. 1–6.

[2] Sohrab Mokhtari and Kang K Yen. “Impact of large-scale

wind power penetration on incentive of individual investors,

a supply function equilibrium approach”. In: Electric Power

Systems Research 194 (2021), p. 107014.

[3] Eugene F Fama. “Efﬁcient capital markets: II”. In: The

journal of ﬁnance 46.5 (1991), pp. 1575–1617.

[4] Andrew W Lo. “The adaptive markets hypothesis”. In: The

Journal of Portfolio Management 30.5 (2004), pp. 15–29.

[5] From Charles D Kirkpatrick II and R Julie. “Dow Theory”.

In: CMT Level I 2019: An Introduction to Technical

Analysis (2019), p. 15.

[6] Robert D Edwards, WHC Bassetti, and John Magee.

Technical analysis of stock trends. CRC press, 2007.

[7] Joseph D Piotroski. “Value investing: The use of historical

ﬁnancial statement information to separate winners from

losers”. In: Journal of Accounting Research (2000),

pp. 1–41.

[8] Partha S Mohanram. “Separating winners from losers

among lowbook-to-market stocks using ﬁnancial statement

analysis”. In: Review of accounting studies 10.2-3 (2005),

pp. 133–170.

[9] Takashi Kimoto et al. “Stock market prediction system with

modular neural networks”. In: 1990 IJCNN international

joint conference on neural networks. IEEE. 1990, pp. 1–6.

[10] Alireza Abbaspour et al. “A Survey on Active

Fault-Tolerant Control Systems”. In: Electronics 9.9

(2020). IS SN: 2079-9292.

[11] Jigar Patel et al. “Predicting stock and stock price index

movement using trend deterministic data preparation and

machine learning techniques”. In: Expert systems with

applications 42.1 (2015), pp. 259–268.

[12] Xiao Zhong and David Enke. “Predicting the daily return

direction of the stock market using hybrid machine learning

algorithms”. In: Financial Innovation 5.1 (2019), p. 4.

[13] Girish Chandrashekar and Ferat Sahin. “A survey on

feature selection methods”. In: Computers & Electrical

Engineering 40.1 (2014), pp. 16–28.

[14] D Van Thanh, HN Minh, and DD Hieu. “Building

unconditional forecast model of Stock Market Indexes

using combined leading indicators and principal

components: application to Vietnamese Stock Market”. In:

Indian Journal of Science and Technology 11 (2018).

[15] Ahmad Kazem et al. “Support vector regression with

chaos-based ﬁreﬂy algorithm for stock market price

forecasting”. In: Applied soft computing 13.2 (2013),

pp. 947–958.

[16] Haiqin Yang, Laiwan Chan, and Irwin King. “Support

vector machine regression for volatile stock market

prediction”. In: International Conference on Intelligent

Data Engineering and Automated Learning. Springer.

2002, pp. 391–396.

[17] Riswan Efendi, Nureize Arbaiy, and Mustafa Mat Deris. “A

new procedure in stock market forecasting based on fuzzy

random auto-regression time series model”. In: Information

Sciences 441 (2018), pp. 113–132.

[18] Isaac Koﬁ Nti, Adebayo Felix Adekoya, and

Benjamin Asubam Weyori. “A systematic review of

fundamental and technical analysis of stock market

predictions”. In: Artiﬁcial Intelligence Review (2019),

pp. 1–51.

[19] Sohrab Mokhtari et al. “A Machine Learning Approach for

Anomaly Detection in Industrial Control Systems Based

on Measurement Data”. In: Electronics 10.4 (2021). ISS N:

2079-9292.

[20] Caren Marzban. “The ROC curve and the area under it as

performance measures”. In: Weather and Forecasting 19.6

(2004), pp. 1106–1114.

[21] Mohammad Abedin, Sohrab Mokhtari, and Armin B

Mehrabi. “Bridge damage detection using machine learning

algorithms”. In: Health Monitoring of Structural and

Biological Systems XV. Vol. 11593. International Society

for Optics and Photonics. 2021, 115932P.

[22] Yash Chaudhary. Stock-Market Sentiment Dataset. 2020.

DO I:10.34740/KAGGLE/DSV/1217821.

[23] Tom Howley et al. “The effect of principal component

analysis on machine learning accuracy with high

dimensional spectral data”. In: International Conference

on Innovative Techniques and Applications of Artiﬁcial

Intelligence. Springer. 2005, pp. 209–222.

8