Table Of Content
Recap
Refined C.O.R.E. Recipe
Benefits Of Using Synthetic Data
Infinity Years Of Data
Data On Demand
How To Use Synthetic Price Data
When Using Real Data Is Acceptable
Synthetic Data Generation in General
Geometric Brownian Motion
Ornstein-Uhlenbeck
Using Sine Waves
No No Code Version
Next Week
Recap
Last week we thought about how we could improve our backtesting approach in the strategy design process. to reduce the risks of data mining and came to the conclusion that it would be far more helpful to use an idea-first approach. On top of that it might be a good idea to use synthetic prices instead of real market data when fitting our trading rules.
This week, we're going to take a look at how to create synthetic prices and their benefits for backtesting purposes. Note that different stages of the design process have different demands. While other types of synthetic data can be useful for things like portfolio optimization, this article focuses on using synthetic prices to configure your trading rules without the risk of overfitting.
But let's first revisit our C.O.R.E. recipe:
Refined C.O.R.E. Recipe
You can find the template here. If you want to change the sheet, you need to make a copy of it into your own drive.
As we mentioned last week, we recommend beginners avoid stationary mean reversion type strategies due to their negative skew profiles, increased capital requirements and difficulty in modeling the 'fair' value of cointegrated pairs. That's why we will focus on trend following for now.
To reiterate: the main idea behind trend following is the fact that humans tend to exhibit herding behaviour, that information gets absorbed with different reaction time across market participants, and that there's no apparent reason for price to not continue going up or down without a shift in momentum.
Our primary source of returns is the risk factor called momentum risk.
If you're curious about other possible sources of excess returns and risk factors to target, you should definitely check out Antti Ilmanens book Expected Returns: An Investor's Guide to Harvesting Market Rewards. It provides a very comprehensive and technical overview of different kind of risk premia and - as the title states - how to harvest them.
Let's add the trading styles rationale based on last weeks discussion to our recipe.
Later down the line we're going to revisit the O - Opening Rule section. In practice we'll probably adopt more than one rule. Trend following behaviour can be identified multiple different ways: via momentum, breakouts, and divergences, to only name a few. In addition we'll also probably use different variations that aren't too correlated on each trading rule to diversify across them (not across the instruments).
Let's keep it simple and use just one MA crossover (8/32) as opening and closing rule for now so we can focus on crafting a synthetic price data generator.
Benefits Of Using Synthetic Data
For now we define synthetic data simply as a series of randomly generated prices. Its purpose is to decouple our trading rule design from real world market data with the goal of eliminating the risk of overfitting to "lucky draws" from the past. But there are even more benefits when using synthetic data:
Infinity Years Of Data
If you only have 10 years of historical prices, which is usually a solid place to start, but then split it 50/50 in an Out-Of-Sample test for your OOS and In-Sample data sets, you end up with only 5 years of training data. Depending on your trading style this might not be enough to be statistically relevant. When using the Walk-Forward method as described in last weeks Issue you need to constantly re-split your data while ensuring that your data sets not only pick up market structure changes but are also big enough to yield statistically robust results. This constant resplitting leads to an increased demand for data points.
Having your own price series generator, means you're able to create endless amounts of data sets. This is especially helpful when you didn't scrape together an extensive data base of real world prices yet. Or when you want to commit as little of your liquidity to data retrieval as possible because your bankroll is rather limited and the data in question is expensive. You need to of course validate your findings with real data at some point but this gives you something you can work with immediately without the need of an investment just yet.
By creating lots of different sets of synthetic prices with different characteristics you can observe how sensitive your parameters are to these which can lead to more robust systems.
The goal is to create enough datasets to represent a big range of market conditions, including: strong persistent trends, weak trends, no trends (chop), mixed with periods of high, low and average volatility measured by annualised standard deviation.
Data On Demand
When creating synthetic price series yourself, you also gain the power to decide what properties that data has. You can decide how much volatility it experiences, the strength of a trends drift, how many trends there are, what direction the trends go, how long they last and even how the skew profile of the series looks. This enables you to design your model across multiple different market conditions even if these conditions weren't available in real live markets.
Controlling the volatility makes it easy to extensively test volatility targeting and position sizing. Risk management is a big part of your trading strategy so getting a better understanding of your strategies overall risk profile is a must.
How To Use Synthetic Price Data
Our trading rules will generate a continous series of forecasts which we then backtest against our extensive collection of datasets with the purpose of fitting them. This way we're capturing the general performance and behaviour of the strategy itself and not just its performance in specific markets.
Other than evaluating general performance, holding period and trading costs, we'll answer most if not all of the questions discussed in last weeks issue. Is our model picking up trends and winning the bets? Can we confirm that we're losing money during chop, etc. If anything is amiss, we need to start over!
When Using Real Data Is Acceptable
It's worth nothing that backtesting with real data is not a heavenly sin and if done right can lead to great results too! The mandatory part is that you still start your process idea-first! Taking the detour of understanding your returns and why they should continue is non-negotioable. And even then you might fall victim to human biases when looking for explanations but it's still a lot better than pure datamining.
Different trading styles like High-Frequency Trading might even get better and more realistic results because of the sheer amount of data available. Those types of trading strategies can refit regularly as market structure evolves while long-term EOD strategies can take years to gather enough data to be significant again.
Synthetic Data Generation in General
When constructing a synthetic price series we need to mainly take two things into account: an underlying process that represents the general behaviour of price and an element of randomness on top of it. The behavioural part is deterministic, meaning you're able to predict the behaviour based on known factors. The randomness adds a stochastic element to it.
The deterministic part for trend following is the drift, which represents the average rate of price movement into one direction. For the stationary mean reversion it is the pull towards the prices mean: whenever prices deviate from it they get pulled back to it after some time. We're going to use normally distributed gaussian noise on top of these processes to add an element of randomness.
We also need to ensure that our price series share common characteristics with real world financiel instruments like prices not going below zero.
Let's take a look at some distinct applications of these general concepts.
Geometric Brownian Motion
Geometric Brownian Motion is a mathematical model that's frequently used in finance to simulate stock prices. It is defined by the equation
where
St: is the price at time t,
μ: the drift coefficient,
σ: the volatility,
Wt: the so called Wiener process
μ represents the deterministic part and specifies the drift of generated price series while the Wiener process uses a gaussian normal distribution scaled to volatility (σ) to add randomness.
Let's look at an analogy to further clarify. Imagine you're trying to predict the path of a paper airplane. The underlying process, the drift, is the direction in which you initially throw the plane. However, the plane's path won't be perfectly straight due to random factors like wind gusts and air resistance. These random factors represent the volatility in the GBM model.
Implemented in python it looks like this:
def generate_gbm_price_series(S0, mu, sigma, T, N):
"""
Generates a price series using Geometric Brownian Motion.
Parameters:
S0 : float : initial price
mu : float : drift coefficient
sigma : float : volatility coefficient
T : float : total time
N : int : number of time steps
"""
dt = T / N
t = np.linspace(0, T, N)
W = np.random.standard_normal(size=N)
W = np.cumsum(W) * np.sqrt(dt)
X = (mu - 0.5 * sigma**2) * t + sigma * W
S = S0 * np.exp(X)
return S, t
S0 = 1 # initial price
mu = 0.1 # drift
sigma = 0.2 # volatility
T = 5.0 # 5 years
N = 365 # number of time steps (e.g., trading days in a year)
price_series, time_series = generate_gbm_price_series(S0, mu, sigma, T, N)
plot_price_series(time_series, price_series, 'gbm_price_series')
and produces a series of prices that looks something like this when plotted using our new plot_price_series() function from plots.py from this weeks GitHub repository:
It's important to note that the assumption of normally distributed returns is a simplification of reality. Real-world stock returns often exhibit fat tails, meaning that extreme price movements are more frequent than predicted by a normal distribution. On top of that GBM uses constant volatility, which also isn't really how t he real world looks.
Ornstein-Uhlenbeck
Another commonly used model to simulate prices is the Ornstein-Uhlenbeck model. It is a great fit for modeling a stationary, mean-reverting price series due to its use of a long-term mean μ and speed of reversion to the mean, theta (θ):
where:
St: is the value of the underyling process at time t,
θ: is the speed of reversion to its mean,
μ: the long-term mean,
σ: the volatility,
Wt: the Wiener process.
Again, the Wiener process adds gaussian noise to the prices while θ(μ - S(t))dt models the deterministic part of the series, reverting back to its mean.
Imagine a ball attached to a spring. The natural resting position of the ball is the mean price. If you pull the ball away from this position (representing a price deviation from the mean), the spring creates a force to pull the ball back. The further you pull the ball, the stronger the spring's force. This spring force is analogous to the deterministic component of the Ornstein-Uhlenbeck process.
While this analogy helps visualise the concept, remember that real-world price movements are far more complex and influenced by numerous factors not accounted for in this simplified model.
Here's the python code:
def generate_ou_price_series(S0, theta, mu, sigma, T, N):
"""
Generates a price series using the Ornstein-Uhlenbeck process.
Parameters:
S0 : float : initial price
theta : float : speed of reversion
mu : float : long-term mean
sigma : float : volatility coefficient
T : float : total time
N : int : number of time steps
"""
dt = T / N
t = np.linspace(0, T, N)
S = np.zeros(N)
S[0] = S0
for i in range(1, N):
dW = np.random.normal(0, np.sqrt(dt))
S[i] = S[i - 1] + theta * (mu - S[i - 1]) * dt + sigma * dW
return S, t
# Example usage
S0 = 1 # initial price
theta = 0.8 # speed of reversion (increase for faster reversion)
mu = 1 # long-term mean
sigma = 0.15 # volatility (adjust for more or less deviation)
T = 5.0 # total time (e.g., 1 year)
N = 365 # number of time steps (e.g., trading days in a year)
price_series, time_series = generate_ou_price_series(S0, theta, mu, sigma, T, N)
plot_price_series(time_series, price_series, 'ou_price_series')
and its graph:
Again, usual shortcomings and pitfalls of assuming normally distributed returns for financial instruments apply.
Using Sine Waves
Another useful approach is to just use a sine wave as the underlying process and then adding gaussian noise. Rob Carver wrote about it in his blog, highlighting the benefit of being able to specify trend lengths yourself. This is particularly useful when analyzing which lookbacks work best on what trend lengths.
Example implementation:
ef generate_sine_series(N, T, X, sigma):
"""
Generates a trend of length N, sine wave amplitude X, plus Gaussian W scaled to std_dev (sigma * amplitude).
Parameters:
N : int : length of the series
T : int : length of the trend cycle
X : float : amplitude of the trend
sigma : float : volatility
"""
std_dev = sigma * X
W = np.random.standard_normal(N) * std_dev
half_amplitude = X * 0.5
trend_step = X / T
cycles = int(np.ceil(N / T))
trend_up = list(np.arange(start=-half_amplitude, stop=half_amplitude, step=trend_step))
trend_down = list(np.arange(start=half_amplitude, stop=-half_amplitude, step=-trend_step))
trend = (trend_up + trend_down) * cycles
trend = trend[:N]
combined_price = W + trend
# Apply exponential transformation to ensure positive prices
positive_price_series = np.exp(combined_price)
return positive_price_series, np.arange(N)
# Example usage SINE
N = 365 # length of the series (e.g., trading days in a year)
T = 50 # length of the trend cycle
X = 1.0 # amplitude of the trend
sigma = 0.2 # scale of the volatility
price_series, time_series = generate_sine_series(N, T, X, sigma)
plot_price_series(time_series, price_series, 'sine_price_series')
and it's plot:
We're going to have a more detailed look at how these approaches differ and when to use which in future issues.
No No Code Version
This week there's no No-Code-Version available. We will supply you with the random price data when we get to the phase of actually backtesting our forecasts.
Next week
We're almost done setting up everything to fire up our trading rules and mold them into an algorithm to test. At some point we will need to consult real market data to at least confirm that our model is working as intended and our synthetic data is behaving like we want to. So next week we're going to build a comprehensive database of free real world market data.
- Hōrōshi バガボンド
Newsletter
Once a week I share Systematic Trading concepts that have worked for me and new ideas I'm exploring.