# Eghbal RahimikiaThe University of Manchester · Alliance Manchester Business School

Eghbal Rahimikia

Doctor of Philosophy

## About

7

Publications

76,701

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

39

Citations

## Publications

Publications (7)

We develop FinText, a novel, state-of-the-art, financial word embedding from Dow Jones Newswires Text News Feed Database. Incorporating this word embedding in a machine learning model produces a substantial increase in volatility forecasting performance on days with volatility jumps for 23 NASDAQ stocks from 27 July 2007 to 18 November 2016. A simp...

This paper compares machine learning and HAR models for forecasting realised volatility of 23 NASDAQ stocks using 146 variables extracted from limit order book (LOBSTER) and stock-specific news (Dow Jones Newswires) from 27 July 2007 to 18 November 2016. We find simpler ML to outperform HARs on normal volatility days. With SHAP, an Explainable AI t...

The study determines if information extracted from a big data set that includes limit order book (LOB) and Dow Jones corporate news can help to improve realised volatility forecasting for 23 NASDAQ tickers over the sample from 27 July 2007 to 18 November 2016. The out-of-sample forecasting results indicate that the CHAR model outperformed all other...

Attached MATLAB code is developed to test whether the underlying structure within the recorded data is linear or nonlinear. The nonlinearity measure introduced in Kruger et al (2005) performs a multivariate analysis assessing the underlying relationship within a given variable set by dividing the data series into smaller regions, calculating the su...

This paper concentrates on the effectiveness of using a hybrid intelligent system that combines multilayer perceptron (MLP) neural network, support vector machine (SVM), and logistic regression (LR) classification models with harmony search (HS) optimization algorithm to detect corporate tax evasion for the Iranian National Tax Administration (INTA...

* Preliminaries: Ways to get help, File extensions, Common data types, Data import/export, Basic commands, Create basic variables, Basic math functions, Trigonometric functions, Linear algebra, Accessing/assignment elements, Character and string, Regular expression, "IS*" functions, Convert functions, Programming, Errors, Parallel computing (CPU &...

## Questions

Questions (18)

I'm using MATLAB R2016a for binary classification (time series prediction) of a financial case. I have a good total accuracy (70~75%) but specificity is about 90% and sensitivity is about 60% and vice versa. Currently I'm using grid search to optimize my classification model (SVM, neural network, etc.) based on total accuracy. My data-set has balanced samples of binary output.

How can I improve results in this case? Can I use any other performance metric to take into account unbalanced sensitivity and specificity (as mentioned, in some cases I have higher specificity than sensitivity and in some cases vice versa.)?

I'm using SVM and (neural network) for a time series prediction data-set in MATLAB R2016a with 800 samples. Currently I'm using 10-fold cross validation and grid search to find best SVM parameters. I'm using 90 samples (after this 800 samples) as out-of-sample to check performance of final model using best SVM (and neural network) parameters and training my model on whole first 800 samples.

The test accuracy of final model (10-fold cross validation) is about 98% (sensitivity and specificity of about 98%) but when I check designed model on last 90 out-of-sample data (which trained using whole first 800 samples) I have a poor accuracy (about 55~59% total accuracy, sensitivity and specificity). This is daily forecasting of a financial market. Why I have this behavior? I checked normal k-fold cross validation and sliding window validation . I had mentioned behavior (poor out-of-sample accuracies) in two methods.

'm testing my automatic trading system in stock market (data mining system). I'm modeling day by day for 30-days and calculate profit in every step. Suppose that my system predicts tomorrow close price is higher and today close price. When I check the real direction of tomorrow price it is higher than today. So we have same direction in real and predicted close price. Now I want calculate profit. Can I use this formula?

**profit_for_this_trade = ( next_day_close_price_sell - today_close_price_buy) - commissions and other buy and sell related costs What do you think?**

(We used today and previous days closing prices of today to train a model and predict next day closing price) What is your proposed formula for calculating this profit for a close price predicting system? Currently I'm using only accuracy of system for performance calculation.

I'm comparing some financial companies using data envelopment analysis(DEA) using their financial statements. Suppose that one of the outputs is income or profit. In some companies we have negative Income or negative profit. Can we use these values beside other positive values in DEA in this financial case? In some companies we have same situation for DEA inputs.

Can we use Threshold Auto-regressive Regression (TAR) for continuous inputs and binary output? Is it appropriate for classification modeling? Output is is one if D(t) - D(t-1) is positive and output is zero if D(t)-D(t-1) is negative.

Suppose that we have a unbalanced data-set for a binary classification problem and we want use 10-fold cross validation for training and testing fitted model.

* Is this correct that we only use sampling methods (under-sampling, over-sampling or SMOTE) in training data?

* If yes how we can implement this sampling methods for 10 fold cross-validation? We should re-sample minority class before cross-validation or we should use a new structure of k fold cross-validation?

* Anyway to implement sampling methods with sliding validation? (I'm working on a binary time series perdition - one step ahead, up-turn and down-tern of output of t+1 comparing to t - t is time).

* Is it not more appropriate to use these sampling methods separately for every year of data-set?

Number of up-turn and down-turn samples in every year:

Total Up Down

_____ ___ ____

2009 234 135 99

2010 243 153 90

2011 241 132 109

2012 240 133 107

2013 240 155 85

2014 241 110 131

2015 243 126 117

2016 29 24 5

All data 1711 968 743

Suppose that we have binary features (+1 and -1 or 0 and 1). We have some well-knows feature selection techniques like Information Gain, t-test, f-test, Symmetrical uncertainty, Correlation-based feature selection(CFS), Mutual information, Chi-square,Balanced Mutual information` etc. Can we use these types of feature selection techniques for binary features or these are only suitable for continues data? What are your recommended techniques for this special type of feature selection?

I want use 6-months financial data of some companies financial reports to create an DEA model. Suppose that we have these financial reports:

6-2013

12-2013

6-2014

12-2014

6-2015

12-2015

We can use data extracted from these financial reports in DEA? Suppose that we need "Labor Costs" in DEA. The labor costs value for 6-2013 is 160,000$, the labors costs for 12-2013 is 300,000$, 6-2014 value is 164,000$ and finally the labor costs for 12-2014 is 310,000$. You can see that labor costs for 6-2013 is comparable with 6-2014 and labor costs for 6-2013 is labor costs from first day of 2013 until 2013 mid-year but labor costs of 12-2013 is full year labor costs (certainly higher value comparing to 2013 mid-year).

We can use this data in DEA or we should only use yearly financial reports?

I have a data-set that contains different states of a country. In every state there are different companies and one company in every state is manager of other companies in that state (other companies are branches of this leader company at different levels). I want normalize (or standardize) this data-set and after that use Factor Analysis(FA) to combine different input features to create a single performance indicator.

- Is is possible to normalize data in every state separately and consider the leading company features values as denominator of other companies in that state?
- Can we compare a company from one state with other company in another state in this structure? (comparing to using one leading company for whole data).
- Is this normalization method affect factor analysis assumptions?

******Whole data leading company is so big and has very high value features so I decided to use this normalization structure. Scale and measurement unit of features are different.

We can calculate efficiency using parametric (econometric models) and non-parametric models. one of the well-known non-parametric models to calculate efficiency (for example banking efficiency in a country) is data envelopment analysis (DEA). What are alternative non-parametric methods to calculate efficiency?

I'm working on a model for publishing in a journal. I'm removing outliers from results of my models. Suppose that I have 100 trained neural network, inserting out-of-sample data to these models and obtaining results. In this step I'm removing outliers (based on abs(x-mean(x))>= 2*s.d and use average of remaining results. How can I prove we need removing outliers in my paper? What statistically procedure or graphical presentation we need?

Update. X-axis is every out-of-sample and Y-axis is outputs for every sample. Output range is 0-1 (50 samples from 8000 samples presented in figure). In this figure green filled circles are geomean, red filled circles are averaging after removing outliers using above formula and blue filled circles are arithmetic mean. We have three outputs here. Which averaging method is suitable in this case? How can I prove that? I think we based on below figures we have appropriate mean value when removing above 2 s.d outliers. What do you think?