Predict the Sales Volume of the Google Stock Price using RNN with LSTM
The dataset includes a daily summary of Google Stock Prices, between 2012 and 2016.
Source: https://www.kaggle.com/akram24/google-stock-price-train
Dataset has 6 columns Date, Open, Close, High, Low and Volume. We are going to do a time series analysis using RNN and LSTM to predict the volume.
For the complete source code (Python Notebook): Source Code
Line up
- Preprocessing data
- Identify correlations in the training data
- Scaling the data
- Fit the model
- Model evaluation
Preprocessing data
First, we need to identify if there are null values and duplicate values in the dataset. If there are such values, we need to get rid of them by either dropping or other imputation method.
There are no null values as well as zero or useless columns at the moment. Let’s see about the distributions of the data.
It seems like Open, Low and High might have a very high correlation according to the graph. Our target prediction volume displays the following distribution pattern.
Identify correlations in the training data
I will consider the numerical variables with the target y to evaluate the correlations.
It seems like I was correct on the correlation prediction. Anyway, that should be the case. For the time being, I will consider only the variations in Volume for the predictions.
Scaling the data
Now I will split the dataset into training, validation and testing respectively 70%,10% and 20%. The time series split is used to do the splitting. 2016 data will be used for testing after training for earlier data.
Minmax scaling into range 0–1 will be applied to the Volume column.
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
Fit the model
For the volume column, the lookback value will be set to 80 which is approximately 2 and half months of data. So the train and validation data has the following shapes
(800, 80, 1) - Training
(46, 80, 1) - Validation
Training will be done with the following set of layers,
- LSTM — 50
- Dense — 20
- LSTM — 50
- LSTM — 50
- Dense — 10
- LSTM — 25
- Dense — 5
- Dense — 1
Early stopping is added with the loss function where patience is 5 just to make sure the stop makes sense.
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dense(units = 20))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dense(units = 10))
regressor.add(LSTM(units = 25))
regressor.add(Dense(units = 5))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['mae'])
es2 = EarlyStopping(monitor='loss', mode='auto', verbose=1, patience=5)
history = regressor.fit(X_train, y_train, epochs = 100, batch_size = 8, verbose=1, validation_data=(X_val, y_val), callbacks=[es2])
Model evaluation
The learning curve for the model is as follows with the validation set. The validation set is smaller in this case.
And the predictions for the testing dataset is as follows,
The model seems to have a less trained prediction as it seems. It doesn’t try to reach the upper levels of the variations in the test set. The R2 value for the test is 0.4025 which is acceptable but it can be improved by considering another variable in the training.
The two variable training results will be added to this article later if time permits.