Forecasting stock index based on hybrid artificial neural network models

Forecasting stock index is a crucial financial problem which is recently received a lot of interests in the field of artificial intelligence. In this paper we are going to study some hybrid artificial neural network models. As main result, we show that hybrid models offer us effective tools to forecast stock index accurately. Within this study, we have analyzed the performance of classical models such as Autoregressive Integrated Moving Average (ARIMA), Artificial Neural Network (ANN) model and the Hybrid model, in connection with real data coming from Vietnam Index (VNINDEX). Based on some previous foreign data sets, for most of the complex time series, the novel hybrid models have a good performance comparing to individual models like ARIMA and ANN. Regarding Vietnamese stock market, our results also show that the Hybrid model gives much better forecasting accuracy compared with ARIMA and ANN models. Specifically, our results tell that the Hybrid com-bination model delivers smaller Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) than ARIMA and ANN models. The fitting curves demonstrate that the Hybrid model produces closer trend so better describing the actual data. Via our study with Vietnam Index, it is confirmed that the characteristics of ARIMA model are more suitable for linear time series while ANN model is good to work with nonlinear time series. The Hybrid model takes into account both of these features, so it could be employed in case of more generalized time series. As the financial market is increasingly complex, the time series corresponding to stock indexes naturally consist of linear and non-linear components. Because of these characteristic, the Hybrid ARIMA model with ANN produces better prediction and estimation than other traditional models.


INTRODUCTION
In the past two decades, the most popular techniques used in forecasting stock prices are the statistical models and the artificial intelligence models (AI). Some most commonly used methods in the statistical models for time series analysis include, e.g., Autoregressive Integrated Moving Average (ARIMA) or the well-known Box-Jenkins model, Exponential Smoothing model (ESM), and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) volatility. Due to the fact that the mean and variance of financial time series change overtime, and, hence, the series are not linear. More precisely, financial time series often contain both linear and nonlinear patterns. Therefore, one of the main restriction in these traditional models is that they only contain a linear structure. In fact, Refenes et al 1 showed that the traditional statistical models, such as ARIMA model, for forecasting have main limitations in applications to non-linear data set such as stock indices, exchange rates. The recent development in the theory of computational intelligence provides powerful mathematical tools for private investors, portfolio man-agers and also bankers to exploit the big data, especially, big data in finance. The AI models and machine learning techniques, e.g., the Artificial Neural Network models (ANN) are introduced and utilized to overcome these restrictions. These models contain two components that are linear and non-linear parts. Recently, a new approach which combines ARIMA and ANN models for financial time series has been studied, e.g., in Zhang 2 , Wang et al. 3 . This combination is called the hybrid model. It is showed that the hybrid model gives more accurate result for forecasting time series, especially, for stock prices. The basic idea of hybrid ARIMA and Artificial Neural Network model is that the non-linear patterns can be presented as the residuals of the linear ARIMA model which can be modeled by using artificial neural networks. Furthermore, the relationship between the linear and non-linear components is assumed to be additive. In this study we utilize the hybrid model to forecast VNINDEX stock price. We find out the suitable ARIMA and ANN models for the time series and then find out the appropriate a hybrid model which combines the ARIMA and ANN models.

FORECASTING METHEDOLOGY
In this section we give a brief description on ARIMA and Artificial Neural Network models. Furthermore, we demonstrate the basic principle in the hybrid model from ARIMA and ANN models.
The ARIMA model ARIMA model was first initiated by Box and Jenkins 4 . This model is one of the most general class of models for forecasting a time series which can be made to be stationary by differencing. More precisely, ARIMA model is generalized from ARMA model (autoregressive moving average) in which the assumption on stationary of time series is not necessary. The important characterization of ARIMA model is that the predictions of the behaviour of a time series in the future depend on the past observations by a linear function and random errors, i.e., the ARIMA equation for forecasting a stationary series Y t has the following form predict for Y t at time t = constant+ weighted sum of the last p values of Y t + weighted sum of the last q values of errors Intuitively speaking, for a non-stationary time series X t , we say that X t is fitted by a ARIMA (p, d, q) process if (i) Y t := (1 − B) d X t is a stationary time series, where B is the backward shift operator, i.e., B j X = X t− j , d is the number of non-seasonal differences needed for stationarity, it is called integration. (ii) The stationary series Y t is a ARMA (p, q) process, i.e., for every t is the random error. The parameter p is the number of autoregressive terms and q is the number of lagged forecast errors in the rediction equation. It is seen that ARIMA processes have two components which are Autoregressive model (AR) of order p and Moving-Average (MA) model.

The artificial neural network approach
One of the most important advantages of an Artificial Neural Networks is to approximate various complex non-linear time series. The ANN is developed from statistical learning algorithm based on mimicking the neural networks in the human brain. It can process parallelly information from data, and, hence, the ANN provides a powerful tool for forecasting time series more accurately. The ANN model consists of layers which are an input layer, output layer and single or more hidden layers. However, a single layer is the most common in modelling and forecasting for time series (see, e.g., 5 ). The algorithm of the ANN can be described as follows. The input layer has one or more inputs where an input is a vector value. Each node in an input layer can be connected to the nodes of the first hidden layer. The data go to the network through hidden layers until attaining the output layer, for example, see the following Figure 1. Intuitively Speaking, let Y t be a time series. The relationship between the future value (the output) and its past values (the inputs) Y t−1 ,Y t−2 , . . . ,Y t−p can be represented by the following equation Where, a t and ω i j , i = 1, 2, . . . , p; j = 1, 2, . . . , q are parameters of the model. They are called the connection weight between layers of the model. Parameters p and q are the number of input nodes and the number of hidden nodes in the model. The function f is the transfer function of the hidden layer taking the form It is seen that f is the logistic function 6 or the sigmoid function taking values on [0, 1]. Furthermore, f is real-valued and differentiable and has some properties such as non-positive first derivative with one local minimum and one local maximum. From (1), we see that the ANN model forecasts the future value by performing a non-linear functional mapping of the past observations. Therefore, we can formulate its general mathematical equation as follows Where, ω is the vector of parameter and the function ϕ is determined by the network structure and appropriate weights. Therefore, ANN can be seen as a nonlinear autoregressive model. The main task when dealing with ANN model for a time series is to select a correct the lagged observations p and an appropriate number of hidden nodes q. Unfortunately, there is no theoretical methods to guide the selection of these parameters, and, hence, in practice, selecting the appropriate values p and q is often conducted from experiments. The hybrid approach where L t , N t denote the linear, non-linear components, respectively. These components can be fitted from data. First stage, ARIMA approach is used to model the linear component and, then, the residuals et from the linear model can be seen as the non-linear relationship. Hence, we can apply the ANN approach to this component. DenoteL t the forecast value at time t, we have By ANN approach, e t takes the form where, φ is a non-linear function determined by the neural network and E t is the random error. DenoteN t the forecast value from (4). From (2), (3) and (4) we have the forecast valueX t of the serieŝ So, there are two steps to perform the hybrid ARIMA neural network model as follows (i) forecast valuesL t (resulted from ARIMA model) (ii) forecast residualsN t (resulted from ARIMA model) by ANN model

Data set
In this study the weekly closing prices for VNINDEX from January 4, 2006 to September 28, 2018 are used (Figures 2 and 3). There are total 663 trading weeks in this period. The data is divided into two periods, the first period includes 654 weeks (as a training set) that are used for model estimation and the second period includes 9 weeks (as a test set) that is reserved for forecasting and evaluation. Financial time series are often not stationary, especially stock prices. Transform stock prices into log return prices is the most common method in analysing financial data. Let Pt be the stock price at time t. The log returns R t are defined as ) .
More details, we refer to 8 for good properties of log return. The log returns are also called continuously compounded returns. The plots of stock prices and weekly log returns are shown in the following Figure 2 and Figure 3.

Error measures
We introduce some of the most common error measures or accuracy measures widely used for comparing different forecasts in financial time series. These measures are used to identify which methods is one of the most suitable forecast methods. The most preferred measure used for forecasting accuracy of a model is the Root Mean Square Error (RMSE), see, e.g., R. Carbone and J. S. Armstrong 9 for more details. It is defined as where N is the sample size.  The following Mean Absolute Percentage Error (MAPE) is also used as a common error measure (see 10 ) Another most popular error measure is known as the Mean Absolute Error (MAE): it is seen that, this measure is easy to both understand and compute.

Results for price data
We use ARIMA, ANN and Hybrid model to fit VNIN-DEX data. We compare these models and chose the best model for this data set. There are a number studies fitting financial data by using these models and show that the hybrid model is the best model for fitting and forecasting closing prices of market (see 2,3,11,12 ). In case Vietnamese market, we also see that the hybrid is the best model for fitting VNINDEX, see the following table for comparing error measures of these models.

DISCUSSIONS
This work is one first attempt applying sophisticated quantitative models to study VNINDEX. To strengthen our results, further data sets and models should be used for testing and validation. We are going to investigate other stock indexes given in Thomson Reuters database as well as explore potential developed models and their necessary improvement. We also interested in studying whether different indexes coming from different countries favor the same type of models, or create country-associated effect.