First Phase Analysis Summary

The goal of this research

We observe a lucrative pattern in the stock market. We call it the Bottle Rocket pattern. This pattern usually occurs in the first hour of the trading day.

If you haven't read about this pattern, please click on: The Bottle Rocket Pattern.

We conjecture that Machine Learning can recognize the start of this pattern, and we are researching various Machine Learning packages to test our conjecture.

Analysis completed on October 28, 2017

Below is a table that summarize the results of our analysis. The table, shown in Figure 1, shows the results that were completed on October 28, 2017. This table compares several metrics between TensorFlow, Scikit-Learn and H2O. The Random Forest models perform well on the Bottle Rocket dataset. It is important to note that TensorFlow did not have a Random Forest API. One of the reasons we like the Random Forest model is because that the dataset does not need to be standardized.

The results are mixed.

We observed a glaring deficiency in TensorFlow. Their high level API (we used TF.Learn) did not have an argument like “balances_classes = True”, which exists in Scikit-Learn and H2O. This is very important because the Bottle Rocket dataset is unbalanced (13:1). We believe that this explains why TensorFlow does so poorly.

This is our first attempt at finding a model for our dataset, and we are lucky that the results are as good as they are. They can be much better, and more work is needed. Machine Learning research is like walking into quick-sand. You get swallowed up very quickly.

Metric TensorFlow sklearn H2O-RF H2O-NN
accuracy 0.6417
precision 0.6218 0.5846 1.0000 1.0000
recall 0.7823 0.6726 0.6742 0.6750
F1 0.6929 0.1255 0.3544 0.3903
AUC 0.6368 0.8580 0.8741 0.9203
mse 0.0 0.0 0.0
logloss ---- 3.1555 0.2354 0.2772

Figure 1 Results of first phase analysis


At the end of this analysis, we were very worried that the dataset was the cause of the poor performance. During this time, we did not do any work on feature engineering, or oversampling.

We spent a lot of time on hyper-parameter optimization, and we were disappointed that the results were not better.

The first phase analysis was basically trial-and-error, and this is very time consuming. We needed a better approach to obtain better results.