[ad_1]

Coding a random forest algorithm to predict the S&P500’s daily direction through the Put/Call ratio.

Main idea of the article: We will create a random forest algorithm that predicts the Put/Call ratio’s direction for tomorrow. Using that information, we will try to predict tomorrow’s return for the S&P500. Hence, we will not predict the direction of the equity market,rather we will try to predict the direction of a time series that is correlated to the S&P500.

first things first, a call option is the right to **buy **a certain asset in the future at a determined price while a put option is the right to **sell **a certain asset in the future at a pre-determined price.

Hence, when you buy a call you get to buy something later and when you buy a put you get to sell something later. Every transaction has two sides, therefore, when you’re buying a call or a put, someone else is selling them to you. This brings two other positions that can be taken on options, selling calls and selling puts. **The put/call indicator deals with the buyers of options and measures the number of put buyers divided by the number of call buyers**. That gives us an idea on the sentiment of market participants around the specified equity (in our case it will be the US stock market).

A **higher **put/call ratio means that there are **more put** buyers (traders are betting on the asset going lower) and a lower put/call ratio signifies **more call buyers** (traders are betting on the rise of the asset). A known way of using this ratio in analyzing market sentiment is by evaluating the following scenarios:

**A rising ratio signifies a bearish sentiment. Professionals “feel” that the market will go lower.****A falling ratio signifies a bullish sentiment. Professionals “feel” that the market will go up.**

Do option markets offer insights to profitable spot/futures trading? this is what we will see in this back-test. By taking a look at the put/call ratio, *we want to translate the sentiment of market participants into trading opportunities.*

**How will we do that?** We will first begin by a simple known strategy where we buy the market when the Put/Call reaches **1.2** and sell (short) the market when the Put/Call reaches **0.65**. The holding period will be 20.

After having performed and analyzed this strategy, we will turn to a more advanced one which is the core of this study. *Is it possible to use a machine learning algorithm to predict tomorrow’s value of the Put/Call ratio and use that information to trade the market?* We’ll see, but first, let us try out the simpler (first) strategy. Note that the historical correlation between the changes in the S&P500 index and the changes in the Put/Call ratio is between **-0.35** and **-0.45**.

The intuition of this trading strategy is that 0.65 and 1.2 are statistical extremes and therefore, the extreme bearish sentiment around 1.2 is bound to reverse and a bottoming of the market out might occur. And vice versa.

We’ll start by importing the necessary libraries. The usual ones necessary for every time series researcher.

`# Importing libraries`

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

Let’s define the necessary variables. The **holding_period** is self-explanatory and is simply how long will we hold the position during initiation. The **investment **variable is our starting balance. And finally, the support and resistance levels are simply the implied statistical extremes that are used as triggers.

`holding_period = 20 `

investment = 1000

support = 1.2

resistance = 0.65

Now, by having an excel file called PCR we will import it to Python in order to be read and further cleaned.

`# Importing data and calculating`

Data = pd.read_excel('PCR.xlsx')

Data = np.array(Data)

The excel file will be composed of two columns, the first one is the Put/Call ratio while the second one is the closing price of the SPX.

Simply put, we will buy whenever the PCR reaches 1.20 and sell whenever it reaches 0.65

`# Buy/Sell conditions`

for i in range(len(Data)):

try:

if Data[i, 0] >= support and Data[i - 1, 0] < support:

Data[i + 1, 2] = 1 # We have added a new columnelif Data[i, 0] <= resistance and Data[i - 1, 0] > resistance:

Data[i + 1, 3] = -1 # We have added a new column

else:

continue

except IndexError:

pass

If we do a quick check on the number of positions we have and whether they are longs (Buy orders) or shorts (Sell orders), then we can use the following code:

`np.count_nonzero(Data[:, 2])`

np.count_nonzero(Data[:, 3])

Giving us **108 **long positions, **23 **short positions, and **131 **positions taken in total between 2006 and 2019 (this is where the free PCR data stops).

Now, to simply calculate the returns over the trades while keeping in mind the holding period of 20:

`# Returns`

for i in range(len(Data)):

try:

if Data[i, 2] == 1:

Data[i + holding_period, 4] = (Data[i + holding_period, 1] - Data[i, 1])if Data[i, 3] == -1:

Data[i + holding_period, 5] = (Data[i, 1] - Data[i + holding_period, 1])

except IndexError:

pass

Having done all of this, we can chart the signals and the equity curve. Note, that the back-test was simplistic and more of an approximation as we didn’t include fees, commissions, contract changes, or risk management tools.

With the performance statistics:

`Profit factor = 3.069Hit ratio = 70.77 %Gross return = 418.0 %Expectancy = 32.17Realized RR = 1.27Buy:Sell ratio = 108 : 23Win% Buy:Sell = 74.0 % : 52.0 %`

**Some observations that need to be said by simply looking at the signal chart**: The number of long signals is noticeably higher than the number of short signals and the signal quality seems to be better. Long signals seem to capture well the botoms of corrections and consolidations. The equity curve of the strategy is somewhat satisfying but still not 100% reflecting the reality (did we really know that this strategy will work back in 2006 if we hadn’t had the results already?), but still, it is interesting to see if it will continue to behave as such in the future. We also note that with the 20-Day holding period used as well as the tendancy of markets to go up, we could have been pushed by this fact to be more profitable. Still, it is rare to see such beautiful upward sloping equity curve.

Will this strategy keep working for the forseeable future? That is hard to say but what is for sure, that it will take a long time to know. From the data above, in 13 years, we’ve only had 131 trades, amounting to 10 trades per year. Relaxing some conditions will give us 1 trade per month. It may not be bad for a sub-allocation of a portfolio to actively trade on this strategy assuming more optimization is done.

The intuition of this trading strategy is that by using a well-liked and well-known algorithm, we want to see if the Put/Call ratio is stationary and predictable enough to give us accurate signals to trade on. We will then compare both strategies and see which one outperformed and which one underperformed.

Before we start by defining our algorithm and going through with the back-test, it is interesting to know how well our model will do in a Utopian world where we can predict tomorrow’s Put/Call ratio with **100% accuracy**? **In other words, knowing the negative correlation and being sure of tomorrow’s change of the ratio, how good will we be at trading the US market?**

So, it becomes mouth-watering to see whether we can predict the Put/Call ratio or not after seeing this almost perfect equity curve (Gross return around **1,200%** since 2006 and by trading everyday the open and close out at the end of the day). Surely, to put things into perspective, we’re not going to obtain such an equity curve and we’re not even sure the algorithm can even predict the PCR that well. **But, the excitement of research is what makes us move forward**. *Fact: I’ve started writing this article as I am creating the algorithm, therefore, most of what I have written so far has been before commencing the back-tests. Now, let’s start with the serious stuff.*

**The Random Forest** algorithm is part of a broader type of algorithms called ensemble methods. In layman’s terms, many decision trees form a random forest which is a type of ensemble methods.** But, what is the intuition of a decision tree without going too much into technical details?**

A decision tree is a supervised learning algorithm that is both suitable for classification and regression problems due to its linear and non-linear suitability.

*Decision trees break down the dataset into smaller subsets using conditions to finally arrive at a probability estimate*. The first tree is called the root node while the ones following it (coming out of it) are referred to as decision nodes. The algorithm used to iterate through the trees to calculate entropy is called **Iterative Dichotomiser 3** (ID3) and was invented by Ross Quinlan. The idea here is that when you average out a lot of models, you will get a smoother and more accurate forecast. That is why random forest algorithms are well-suited for both classification and regression problems. The random forest algorithm is but a **collection of decision trees** to reduce overfitting and average out the results giving us supposedly better accuracy. Hence, a decision tree is just a random forest with one decision tree.

We’ll start by importing the necessary libraries. The first line imports the RandomForestRegressor function from sklearn which we will use below to predict the PCR.

`# Importing libraries`

from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

The **trees **variable is the key to the Random Forest algorithm, the more there are, the more complex the model will be. We will choose 20 trees in our example. The lag variable is how far in the past will we look to include in the prediction formula. A value of three means that the last 3 PCR change-in-values will be used to predict tomorrow’s change.

`# Parameters`

investment = 1000

trees = 20

lag = 2

Now, what variables will the Random Forest algorithm use to predict tomorrow’s PCR? As said just above, we will use a technique called Autoregression in which lagged values are used as explanatory variables. This means that today and yesterday’s change in the PCR can help us predict tomorrow’s change (i.e. is it going to go up or down from today’s value).

So, given the below data structure:

Autoregression Matrix on the PCR. On the far left is the most recent observation. Going to the left, we take yesterday’s observations.

The above matrix means that if we take index 3 (the first row), then the -0.02 can be explained by 0.01, -0.07, and 0.06 given a calculated relationship. In other words, the first three columns are the explanatory variables that should possess some information on the last column.

We will call the first three columns **X **(the explanatory variables) and the last column **Y** (the independent variable).

`# Separating the data`

Determinant = len(Data.columns)

X = Data.iloc[:, 0:Determinant-1].values

Y = Data.iloc[:, -1].values

Y = np.reshape(Y, (-1, 1))

So, now let’s take the X and Y datasets and divide them into training set and test set:

`# Splitting the dataset into the Training set and Test set`

X_train = X[:-projection,] #Training explanatory variables

Y_train = Y[:-projection,] #Training predictions

X_test = X[-projection:,] #Test explanatory variables

Y_test = Y[-projection:,] #Test predictions

The below code defines the Random Forest function, chooses the number of trees (through the n_estimators variable) and fits the training set to obtain the implied relationship that we hope to continue towards the future.

`# Fitting the model`

regressor = RandomForestRegressor(n_estimators = trees)

regressor.fit(X_train, Y_train.ravel())

What we want to do now, is take that relationship, give it a dataset similar to X_train and apply the forecasts. Luckily we have created our X_test dataset that is the out-of-sample structure. It should look something like this:

`# Predicting the Test set results`

Prediction = regressor.predict(X_test) #Predict over X_test

Prediction = np.reshape(Prediction, (-1, 1))

Real = Y_test

Doing that will give us two datasets, the real changes in the PCR and our predictions. Let’s concatenate them together and calculate the correlation coefficient to give us an idea whether we’ve been doing well or not. Although, the correlation coefficient doesn’t say anything about the accuracy (as the magnitudes may be large enough to distort it)

`Comparison = np.concatenate((Prediction_LR, RealLR), axis = 1)`

np.corrcoef(Comparison[:,0], Comparison[:,1])

The output will give us

`# 0.343`

Which is not bad for a start. But how well would that do if we try trading the S&P500? Let’s continue. The below code is simply applying the strategy which is taking our PCR predictions, putting them in parallel of changes in the SPX while accounting for time bias, and **then checking whether a negative prediction before the start of the day is associated with a positive SPX change in the end of the day or not?**

`DataSPX = pd.read_excel('PCR.xlsx')`

DataSPX = np.array(DataSPX)

DataSPX = DataSPX[:, 1]

DataSPX = np.reshape(DataSPX, (-1, 1))

DataSPX = pd.DataFrame(DataSPX)

DataSPX = DataSPX.diff()

DataSPX = DataSPX.iloc[1:,].valuesDataSPX = DataSPX[-projection:,]

Combined = np.concatenate((Prediction_LR, DataSPX), axis = 1)

Combined = adder(Combined, 4) #Function to add 4 columns (Check end of article)

for i in range(len(Combined)):

if Combined[i, 0] < 0:

Combined[i, 2] = 1

if Combined[i, 0] > 0:

Combined[i, 3] = -1

for i in range(len(Combined)):

if Combined[i, 2] == 1 and Combined[i, 1] > 0:

Combined[i, 4] = Combined[i, 1]

if Combined[i, 2] == 1 and Combined[i, 1] < 0:

Combined[i, 4] = Combined[i, 1]

if Combined[i, 3] == -1 and Combined[i, 1] > 0:

Combined[i, 5] = -Combined[i, 1]

if Combined[i, 3] == -1 and Combined[i, 1] < 0:

Combined[i, 5] = abs(Combined[i, 1])

`Profit factor = 1.209Hit ratio = 51.4 %Gross return = 83.0 %Expectancy = 1.67Realized RR = 1.14`

The good news about the algorithm is that it doesn’t suffer from any bias. The long-to-short ratio is around 1.0 (Meaning there are as many buy orders as there are sell orders).

After performing the back-tests, we can make the following observations:

- Perfectly predicting the Put/Call ratio will give us a utopian index strategy.
**The first strategy’s results**showed that the PCR might add some value in trading if it’s near the extreme. The signals as well as the performance show some timing capacities. The long predominance however makes the basic PCR strategy biased. I would recommend keeping it as a tool to aid in decision-making. For example, a trader wanting to go long the US stock market and the PCR is showing upper extreme values. This could be a confirmation to the existing conviction.**The second strategy’s results**showed that optimization is necessary to improve the model as a pure Random Forest with only 10 trees does not do much in predicting the direction of the stock market using the PCR.- What else? Sklearn library has many regression algorithms. We are not bounded by only using the Random Forest algorithm. We can try as many as we want and average out their predictions to create an even better model. The possibilities are limitless.

`# The adder function seen above. It adds a number of desired columns to a datasetdef adder(Data, times):`for i in range(1, times + 1):

z = np.zeros((len(Data), 1), dtype = float)

Data = np.append(Data, z, axis = 1)return Data

If you found this article helpful, please consider leaving a tip. Thanks for reading and best of luck on your job search !

Buy me a coffee so I can do surgery for my cat :https://www.buymeacoffee.com/botservices

[ad_2]

Source link