Time series prediction: A combination of Long Short-Term Memory and structural time series models

Use your smartphone to scan this QR code and download this article ABSTRACT Stock market is an important capital mobilization channel for economy. However, the market has potential loss due to fluctuations of stock prices to reflect uncertain events such as political news, supply and demand of daily trading volume. There are many approaches to reduce risk such as portfolio construction and optimization, hedging strategies. Hence, it is critical to leverage time series prediction techniques to achieve higher performance in stock market. Recently, Vietnam stock markets have gained more and more attention as their performance and capitalization improvement. In this work, we use market data from Vietnam's two stock market to develop an incorporated model that combines Sequence to Sequence with Long-Short Term Memory model of deep learning and structural models time series. We choose 21 most traded stocks with over 500 trading days from VN-Index of Ho Chi Minh Stock Exchange and HNX-Index of Hanoi Stock Exchange (Vietnam) to perform the proposed model and compare their performance with pure structural models and Sequence to Sequence. For back testing, we use our model to decide long or short position to trade VN30F1M (VN30 Index Futures contract settle within onemonth) that are traded on HNX exchange. Results suggest that the Sequence to Sequence with LSTM model of deep learning and structural models time series achieve higher performance with lower prediction errors in terms of mean absolute error than existing models for stock price prediction and positive profit for derivative trading. This work significantly contribute to literature of time series prediction as our approach can relax heavy assumptions of existing methodologies such as Auto-regressive– moving-average model, Generalized Auto-regressive Conditional Heteroskedasticity. In practical, investors from Vietnam stock market can use the proposed model to develop trading strategies.


INTRODUCTION
Describing the behavior of the observed time series plays a critical role to understand the past and predict the future in many disciplines. In quantitative finance, time series prediction is a very important task for risk management to measure of the uncertainty of investment return 1 , portfolio construction for hedge fund 2 , high-frequency trading 3 . However, creating high accuracy prediction of time series with low error is not an easy job, due to high fluctuations of stock market. From this perspective, there have been many methods that were proposed to study historical patterns of time series data to crate high quality of stock price prediction. To measure quality of a prediction, forecast error indicators were compute from actual price and predicted price. Mean Square Error (MSE) or Root-Mean-Square Error are populated indicators for the measurement works. In terms of time series prediction, there are many approaches to address the matter. However, there are two methods which have been widely adopted. The first one is univariate analysis to capture volatility. They include autoregressive models (AR(p)), movingaverage models (MA(q)), combination of autoregressive and moving-average models (ARMA(p, q)) for linear processes, and generalized autoregressive conditional heteroscedastic (GARCH(p, q)) for nonlinear processes to model return of stocks 4 . By differencing, a transformation from price to return of a stock poses a problem. Unobserved components (e.g. seasonal component, trend) of raw series were eliminated. Furthermore, differencing is hard to interpret and select adequate model. Hence, the second approach have been proposed to fill the gap 5 . It is called Structural Time Series Models which comprises trend component, seasonal component, and a random irregular component to model a time series without differentiation. With revolution of computational power, beside statistical models, machine learning, and deep learning models have been widely adopted to solve many problems from academic to industry. In financial time series prediction, we can leverage these models to achieve higher accuracy. In deep learning, models are considered black-box with billions of parameters. However, feature engineering is still an important work to improve model accuracy 6 . Furthermore, neural network like Long-Short Term Memory of deep learning can link current event to previous events while Structural Time Series Models only depends on previous event. Hence, it is critical to combine these two approaches to address the limitations. In this work, we step-by-step describe procedures to perform fitting data with Structural Time Series Models, Sequence to Sequence model, and the combination of these two models. We report the results of the fitting process to 21 stocks listed on Ho Chi Minh Stock Exchange, then we compare the prediction results. Furthermore, we use proposed model to automatically trade VN30F1M futures contract on Ha Noi Stock Exchange for back testing.

LITERATURE REVIEW Trend Models of Structural Time Series Models
Decomposition of time series is an important procedure. Traditionally, regarded as a functional depended on time and deterministic, non-stationary time series are often detrended by applying difference to construct models from the processed data. It is suggested that this procedure may lead to misleading results if trend is not deterministic 7 . A structural time series models are a decomposable time series in terms of three components of trend, seasonality and cycle 8,9 . It is defined as following equation: (1) where t = 1, . . . , T,, and g(t) is stochastic and nonperiodic changes trend, s(t) is a seasonal stationary linear process with periodic changes (e.g. quarterly, yearly seasonality), and h(t) is a cyclical frequency of time occurring on potentially irregular schedules over one or more days 10 . Many researches strongly support the model in practice have been carried out. For instance, Harvey shown that class of structural models have several advantages over the seasonal ARIMA models adopted and are applicable to model cycles in macroeconomic time series 5,11 . Kitagawa, Gersch decomposed time series into trend, seasonal, globally stationary autoregressive and observation error components with state space Kalman filter and used Akaike minimum AIC procedure to select the best of the alternative state space models 12 Taylor, Letham use structural models for forecasting of business time series 10 .
The local linear trend is a process can be regarded as a local approximation to a linear trend. The stochastic linear process can be described as: , and ζ t ∼ NID(0, σ 2 ζ ) are distributed independent of one another and white noise disturbance terms with mean zero and variances σ 2 ε , σ 2 η and σ 2 ζ respectively 13 . Koopman and Ooms 14 proposed trend with stationary drift process to extend local linear trend process by adding a stationary stochastic drift component: there is a drawback with this approach that make such drift processes are difficult to identified. It requires very large data samples. Taylor and Letham 10 developed new type of trend models. Accordingly, they suggested that there are two types of trend models: a saturating growth model, and a piecewise linear model (see Figure 1). Saturating growth model is characterized by growth rate and limitation of population growth. By applying nonlinear logistic function: with e is the natural logarithm base, m is the value of sigmoid middle point, C is the maximum capacity value, k is growth rate of the curve. From that point of view, it cannot be captured movement in dynamic world due to nonconstant growth of maximum capacity value and rate of the curve. Hence, to overcome the issues, Taylor and Letham defined a timevarying of maximum capacity C and growth rate k. Suppose that we explicitly define S changepoints at times s j , j = 1, . . . , S,, and a vector of rate adjustments δ ∈ R S with δ j is the change in rate that occurs at time s j 10 . The saturating growth model is defined as: Maximum capacity C(t) is adopted from external data source. From saturating growth model, we can define piecewise linear model without exhibit saturating growth: like saturating growth model, k is the growth rate, δ has the rate adjustments, m is offset parameter, and γ j is set to −s j δ j to make the function continuous.

Recurrent Neural Network
Despite powerfulness of deep neural networks, traditional neural networks have two drawbacks 15 . Firstly, main assumption of standard neural networks is independence among the samples (data points). On the other words, traditional neural networks cannot link current event to previous events to inform later ones due to it stateless preservation. In time series analysis, it is widely accepted that current value depends on past values 4 . It is unacceptable because the independence assumption fails. Secondly, traditional neural networks require fixed-length vector of each sample. Hence, it is critical to develop a new generation of deep neural networks. Rumelhart, Hinton, Williams (p.533) introduced a new learning procedure for neuron networks with backpropagation which can capture internal hidden state to "represent important features of the task domain" 16 . Furthermore, with current development, recurrent neural network can model sequential data with varying length and time dependences. A simple feed forward recurrent neural network is defined 17 : is hidden state of input data point at time t. Clearly, h (t) is influenced by h (t−1) in the networks previous state. The output y (t) at each time t is calculated given the hidden node values h (t) at time t. W yh is weight matrix of input-hidden layer and W hh is the matrix of hidden-to-hidden transition. In most context, h (0) is initialized to zero. Haykin, Principe, Sejnowski, Mcwhirter suggested that RNN can achieve stability and higher performance by nonzero initialization 18 . By comparison to traditional fully connected feedforward network, a recurrent neural network takes advantage of sharing parameters across the model that helps it learns without separately at each position of sentence or series 19 . Earlier, Jordan proposed an almost like 17 . However, context nodes are fed from the output layer instead of from hidden layers 20 . It means that Jordans neural network can take previous predicted output into account to predict current output.
In term of training, there are two steps to train a recurrent neural network. First, the forward propagation creates y outputs. After that, loss function value L( y k , y k ) of the network of each output node k are compute in backpropagation stage. There are many types of loss function to measure distance between the output and the actual value of classification problems. To minimize the distance, we need to update each of the weights iteratively by applying backpropagation algorithm 16 . The algorithm applies derivative chain rule to calculate the derivative of the loss function L for each parameter in the network. In addition, weights of neural network are updated by gradient descent algorithm 15 . Hence, gradient of error of a neuron is calculated as: where a k = w a k + b is input to node k and a k is incoming activation output of a k , g ′ k (a k ) , is activation function for node k. The first term expresses how fast the cost is changing as a function of estimated output. The second term g ′ k (a k ) suggests rate of change of g k activation function at a k . In vectorized form, we generalize equation (10) for any layerl th : δ l = ∇ y C⊙g ′ (a l ) (11) In addition, from the δ l , we can compute the error of the next layer δ l+1 as: and error with respect to any weight, bias in the neural network: ∂C ∂ b l = δ l From the final layer to first hidden layer, for each layer of the neural network, we can apply the backpropagation and compute the error vector δ l with the chain rule repeatedly to update weight and bias vectors. In term of local minimum optimization, gradient descent is utilized for finding the minimum of cost function by updating weight and bias vectors. It is computed as: where m is number of training examples in a given mini-batch with each training example x, η is a step size. In practical, there are many optimizers developed to improve mini-batch gradient descent limitations 21 . For instance, Qian 22 and Yu 23 accelerated gradient was developed to relax navigating ravines problem of stochastic gradient decent. Recurrent neural network is a breakthrough in temporal sequence by adding internal state (memory) in each cell to process sequences of inputs. In term of training, recurrent neural network parameters can be computed and optimized by feed forward propagation and backpropagation. For shallow network with a few hidden layers, the algorithm can be trained effectively. However, with many hidden layers, it is hard to train the network due to vanishing and exploding gradient problem as derivatives become too small (e.g. 0 to 1 for sigmoid activation function) or too large. It only allows the network to learn in short-range dependencies and prevents from learning long-range dependencies. As a result, long-short term memory network architecture 24 , rectified linear units activation function 25 , residual learning framework He, Zhang, Ren, Sun were introduced to overcome the limitation 26 .

Long-Short Term Memory Network
Formally identified by Hochreiter in both theoretical and experimental approaches 27 , with involvement of long-term dependencies data, back propagation algorithm of recurrent neural network is showed that it suffers from insufficient that tends to explode or vanish through time may lead to oscillating weights or unusable model. Not just recurrent neural network, Bengio, Simard, Frasconi 28 also pointed out that any deep feed-forward neural network with shared weights may have vanishing gradient problem. Hochreiter, Schmidhuber (p.6) developed a new approach called Long Short-Term Memory (LSTM) to fill these gaps by introducing "input gate unit", "output gate unit", and "memory cell" 24 . Accordingly, the purpose of multiplicative input gate unit is to protect memory contents from irrelevant inputs, and multiplicative output gate unit is to protect other units from perturbation by currently irrelevant stored memory contents. On the other words, with the new LSTM architecture (see Figure 2), each cell can maintain its state over time, and adjust input or output information. Hence, the new type of neural network architecture is able to capture very long-term temporal dependencies effectively, handle noise and continuous values with unlimited state numbers in principle. Since introduction, with revolution of computational power, LSTM has been widely adopted and applied for many difficult problems of many fields in practice and academic. This includes language modeling 28 , text classification 30 , language translation 30 , speech recognition 31 . From original LSTM proposed by Hochreiter, Schmidhuber 24 , a significant improvement had been developed by introducing forget gates to reset out-of-dated contents of LSTM memory cells 32 . In addition, to achieve higher capability of learning timings, peephole connections that allows gates to look at cell state were added to LSTM neural network. A forward pass LSTM architecture with forget gate and peephole connections is described as 33 :

Science & Technology Development Journal -Economics -Law and Management, 4(1):500-515
Like RNN, LSTM is trained with gradient descent as it is a differentiable function estimator 34 . Backpropagation equations of LSTM are detailed: Where * can be one of It is worth to note that peephole is not always implemented as forget gate because it simplifies LSTM and reduce computational cost without significantly scarifying performance. For instance, Keras 35 does not support peephole, but CNTK, TensorFlow does support 35

Sequence to sequence model
Sequence to Sequence is a learning model that maps an input sequence from a fixed-sized vector using a LSTM to another LSTM to extract an output sequence. Sequence to Sequence has been widely applied in machine translation 40 , video captioning 41 , time series classification for human activity recognition 42 . Bahdanau et al. used RNN Encoder-Decoder that contains two recurrent neural networks (or long short-term memory) to represent an input sequence into another sequence of symbols 43 . One the other words, encoder-decoder architecture is used to encode a sequence, decode the encoded sequence, and recreate the sequence. The approach aims to maximize the conditional probability of output sequence given an input sequence. Encoder neural network transforms an input sequence of variable length X = x 1 , x 2 , . . . , x T into a fixed-length context variable with information of the sequence (see Figure 3). RNN is mostly used as an encoder neural network. However, Sutskever et al. 40 found that LSTM significantly outperformed shallow LSTMs and RNN. As mentioned, RNN and LSTM use previous hidden states h 1 , h 2 , . . . , h t−2 , h t−1 to create current hidden stateh t . Hence, hidden state of an input sequence is defined as: where is hidden state at time t, c is summary hidden state of the whole input sequence, function f() can be RNN, LSTM, GRU network, or an activation function. With summary hidden state c, given a target output Y = y 1 , y 2 , . . . , y T ′ , instead of computing P(Y |X) directly, decoder neural network computes conditional probability of using previous information and summary hidden state. It is formally described as: P(y 1 , . . . , y T ′ |x 1 , . . . , y T ) = ∏ T ′ t ′ =1 P(y t , |c, y 1 , . . . , y t ′ −1 ) (18) The trained sequence to sequence model can be used to generate a sequence give an input sequence. In machine translation, reverse the order of the words of the input sequence is necessary because it is easy for optimizer (e.g. stochastic gradient decent) to "establish communication between the input and the output" 40 (p.3). For the sake of nature, time series prediction problems always have desired order as input and output is straightforward sequence.

Data
In this study, for liquidity and fairness of trading, we use daily price data of 21

Data Pre-processing
Beyond algorithm improvement and parameter tuning, an approach to improve the accuracy of machine learning model is apply data pre-processing techniques. For instances, these techniques are imputed missing values, encode categorical data, detect outliers, transform data, and scaling data. In this work, we perform logarithm and Box-Cox transform to transform the input dataset. Rationally, the idea behind the logarithm transformation is to turn probabilistic distribution of raw input data from skewed data into approximately normal. Hence, prediction performance is improved dramatically 44 . However, in some circumstances, the logarithm technique does not generate new data with less variable or more normal. In contrast, it may lead to be more variable and more skewed 45 . Thus, it is recommended that transformation techniques must be applied very cautiously. Output data of the transform stage is passed to data scaler to be normalized. There are many types of scaling method (e.g. maximum absolute value, given range of feature). We use min-max scaler by scaling the input feature to a range of [0,1]. It ensures the large input value do not overwhelm smaller value inputs, then helps to stabilize neural networks 46 .

Structural Time Series
The aim of this step is to create baseline models for evaluating prediction quality of structural time series and sequence to sequence models with our proposed model. Mean square error was calculated to measure performance of each out-of-sample forecast. We develop structural time series models as a baseline model. For this task, we choose Prophet package which is developed by Facebook for Python programming language 10 . In this model, data input is a transformed price of stocks in logarithm. In terms of parameter tuning, we almost use default settings except adding monthly, quarterly, and yearly with Fourier orders. We initialize 20, 30, 30 for Fourier orders of monthly, quarterly, and yearly respectively. As it is required future data would have to be known to perform prediction if we use Box-Cox transformation as an extra regressor in structural time series models, we omit the transformation procedure 47 . Without extra regressor, the model can generate prediction of 21 selected tickers from 5 to 45 with 5-step incrementation window size. As mentioned, Prophet model is structural time series models that combines trend g(t), seasonality s(t), and irregular events. Figure 4 describes our attempt to generate out-of-sample prediction for model quality evaluation and trend g(t) of series as a feature input of Sequence to Sequence using Prophet model from transformed logarithmic form and Box-Cox form of stock price series. In detail, for every stock v in selected list of stocks, we transform the price to log-scale LP and Box-Cox series BC to use as an input for Prophet model P. We set no out-of-sample prediction (W=0) to extract trend series T from in-sample data generated by P as a feature of Sequence to Sequence model. For performance comparation, we set w to every 9-type W of window size for out-of-sample prediction.

Sequence to sequence Model
Regarding to baseline models, we also develop a Sequence to Sequence with LSTM architecture. We use Keras with Tensorflow backend to create Encoder-Decoder model to solve the sequence to sequence problem 35,36 . To benefit from the efficiency of parallel computation for training deep learning neural network, we train the model on virtual machine with GPU on Google Cloud Platform. Sequence to sequence model use states of encoder neural network to generate prediction from decoder neural networks. Hence, we feed normalized stock price series to the model and generate prediction. In Figure 5, we describe approach that we use to develop baseline prediction with sequence to sequence model. Like vanilla LSTM model for supervised learning, we train input data with many iterations. However, we discard output of encoder and use state and as input for decoder. Furthermore, to create prediction for the proposed model, we add trend series (extracted from Figure 4) as another input feature. The implementation is straightforward. First, like Figure 4, we use scaled data of Box-Cox BC and logarithm transforms LP as input data. However, we scale every BC and LP to range from 0 to 1 to create x for every scaled list of stocks price X*. Furthermore, we use logarithm transformed series as target data. We create and extract hidden states of encoder model En with LSTM architecture and initialize decoder model DE with these hidden states. A main advantage of Sequence to Sequence with LSTM over structural time series models is that it can dynamically perform prediction multiple time steps without requiring extra data. In terms of accuracy, we found that result of deeper LSTM model does not outperform shallow one. Hence, we used LSTM with single hidden layer, with 64 cells and rectified linear unit activation function. To prevent over-fitting, we apply both L2 regularization and dropout. We use 0.0001 for regularization parameter lambda, rate of dropout is 0.001 as recommended 48 .

Sequence to Sequence with Structural Time Series Models
In this step, we combine both sequence to sequence model and structural time series models. Specifically, we use output dataset D (with W = 0) from Figure 5 as train data for Figure 6. On the other words, we combine trend component of structural time series models with price of stock in Box-Cox and logarithm forms. Parameters of these models are defined exactly same as aforementioned baseline models. We found that results are improved dramatically.

Results Analysis
Structural time series models was used to generate a set of out-of-sample forecast in multiple window time steps in log-scale (see Table 1). In terms of prediction error, the result show that MSE = 0.087787 (CTG at 45 time steps ahead) is highest, MSE = 0.003501 (SSI at 5 time steps ahead) is lowest. Likewise, results from Sequence to Sequence model (see Table 2) and Sequence   to Sequence with Structural Time Series Models (see Table 3) show that MSE = 0.231800 (PNJ at 45 time steps ahead) and MSE = 0.046146 (ACB at 45 time steps ahead) are highest, MSE = 0.000068 (CII at 5 time steps ahead) and MSE = 0.000006 (CII at 5 time steps ahead) are lowest. Figure 6 plots prediction output of models with actual data of HCM stock. In term of back testing, by applying Figure 6, we found that the proposed model can create positive profit (see Figure 7). For simplicity, we do not consider tax and transaction fee. From initial invested money $1000, we get $1159.2 at the end of the test. Specifically, we develop trading environment from real market data with return TRR (index point) to measure reward of the test. Agent is developed from proposed model. For every day of 120 trading days TD, it uses predicted return PR to choose positions. If predicted return PR on the next two days (W=2) is positive, we choose Long position. If it is negative, we choose Short position. If is around zero, we hold position. Position is closed when profit PFT is bigger than a point or the position is held more than a day (T=2). From univariate time series analysis perspective, we found that structural time series models of Facebook Prophet generate stable and high quality out-ofsample prediction without requiring advanced techniques or data assumptions. In addition, we also found that it even achieves higher accuracy in-sample fitted data when we add an extra regressor to structural time series models. Unfortunately, we cannot create out-sample prediction with extra regressor. In contrast to structural time series models, Sequence to Sequence model with LSTM neural network cannot create stable out-of-sample prediction. As Figure 8 point out, in some cases, Sequence to Sequence model captures movement of stocks to generate high accuracy prediction with lower error than structural time series models. However, the model cannot constantly capture movement of stocks in some other cases. In terms of computational performance, Sequence to Sequence model also takes more time for training and predicting than structural time series models. It leads to a gap to leveraging the state-of-the-art technique for time series prediction. Fortunately, results from Table 3 suggest that we can fill gaps of structural time series models and Sequence to Sequence model by adding output from structural time series models to Sequence to Sequence model. Figure 8 show that the model is stable and prediction error of proposed model is almost always lowest among in three models. In terms of benchmark limitation, there are some drawbacks in this benchmark. On the one hand, it is lack of residual analysis for each prediction. We only compute Mean Square Error (MSE) for performance comparison. The evaluated results are not consistent enough to be fully accurate as some outlier points as Figure 9 point out. On the other hand, although the results are clear and useful when we use MSE as an indicator for forecasting accuracy evaluation, these forecasting evaluation criteria cannot be discriminated between forecasting models when errors of the forecast data are very close to each other. Thus, the Chong and Hendry encompassing test for nested models 49 should be carried out to evaluate the statistical significance of the forecasting models.      However, there is no package in Python supporting the test at this time, the test was not carried out to conduct appropriated benchmark in terms of statistics. In addition, Diebold-Mariano (DM) test for comparing predictive accuracy (not for comparing models) cannot be applied as it only works for non-nested models 50,51 . Hence, we develop a back testing for the best model as our benchmark suggest (i.e. the proposed model) with real market data in different asset class to relax this limitation.
Overall, in same window size, the combination of structural time series models and Sequence to Sequence model are always achieve high performance than pure structural time series models and Sequence to Sequence model. However, in some cases, the hybrid model cannot capture movement of stock when market is highly volatile.

CONCLUSION AND DISCUSSION
In this work, we generally discussed a set of procedures to model and predict price of stocks in Vietnam stock market with structural time series models and Sequence to Sequence model and the combination of these models. Specifically, we fit stock prices data with structural time series models then use fitted data as input feature of Sequence to Sequence model and generate out-sample prediction. We used output of models to compare accuracy performance of each model. We found that our proposed model can overcome limitations of each model and generate forecast with higher accuracy. The proposed model also achieves positive results for derivatives trading with real market data. Hence, the combination of Long Short-term memory and structural time series model is applicable to Vietnam stock markets. Furthermore, deep learning is a powerful approach to address time series problems. However, without fea-  ture engineering, deep learning generates prediction lower accuracy than structural time series models. In future work, we will improve that model to achieve real-time prediction to apply for quantitative trading.
In addition, we believe that Generative Adversarial Networks (GAN) is a promising approach to apply.