Second Phase Analysis Summary

The goal of this research

We observe a lucreative pattern in the stock market. We call it the Bottle Rocket pattern. This pattern usually occurs in the first hour of the trading day.

If you haven't read about this pattern, please click on: The Bottle Rocket Pattern.

We conjecture that Machine Learning can recognize the start of this pattern, and we are researching various Machine Learning packages to test our conjecture.

Analysis completed on August 5, 2018

Below are two tables that summarize the results of our analysis. The table shown in figure 1 shows the results that were completed on August 5, 2018, and in figure 2 are the results that was completed on October 17, 2017.

As you can see, there has been a significant improvement in the metrics. This is due to two factors:

  1. SMOTE: SMOTE is used to oversample the dataset. To learn more about SMOTE [click here].

  2. TPOT: We learned about TPOT when we used Google Colaboratory to do some analysis [click here]. When we saw the first results from TPOT, we knew that we must pursue AutoML further. Figure 1 shows the results from three AutoML packages. As you can see, TPOT provided the best results at this time.

Metric TPOT auto-sklearn H2O-AutoML
accuracy 0.93925 0.91943799 0.88314970
precision 0.92127 0.89705603 1.00000000
recall 0.96143 0.94877461 -
F1 0.940922 0.92219076 0.88748788
AUC 0.939108 0.91924997 0.95078805
mse 0.06075 0.08056201 0.09578897
logloss 2.09826 2.78255718 0.32046170

Figure 1 Results of second phase analysis

Metric TensorFlow sklearn H2O-RF H2O-NN
accuracy 0.6417
0.9086
0.9054
0.9193
precision 0.6218 0.5846 1.0000 1.0000
recall 0.7823 0.6726 0.6742 0.6750
F1 0.6929 0.1255 0.3544 0.3903
AUC 0.6368 0.8580 0.8741 0.9203
mse 0.0 0.0 0.0
0.0
logloss ---- 3.1555 0.2354 0.2772

Figure 2 Results of first phase analysis

Conclusion

It is clear to me that we wasted a lot of time doing manual hyper-parameter optimization.

At the end of the first analysis (October 17, 2018), we were very worried that the dataset was the cause of the poor performance. AutoML has shown that it's not the dataset; but rather, the predictors. The main reason TPOT did so well was that it added new predictors by using feature engineering. We now believe that is where we sould speed our time for the next phase of the analysis.