There are some topics that grab your imagination, and demand immediate attention and interest: things like war, recession, and the World Cup. Unless you share my particular personality quirks, however, the risks of unscientific backtests are unlikely to be among them. This is a shame, however, as unscientific backtesting poses a pervasive, insidious and largely unquantifiable hazard.
There are many reasons why backtests may turn out to be unscientific, but in principle the mistake is always the same – where there is a lot of noise in a data set, and one can try lots of models, it is easy to find a model that explains the noise in a data set without actually describing any signal. To give an idea of how serious this can be, we show a random quadratic , and added random normal errors with a standard deviation of 5, to make a series of “observations”. We then fitted a polynomial of order 2 and a polynomial of order 61, to give us the in-sample2 fit. We’ll then use these two models out-of-sample to show the risks of overfitting a model.

We then test these two models out-of-sample. Since the signal is quadratic, the quadratic model fares pretty well:

However, the order 6 polynomial, driven by a 0.000561x6 term, fares somewhat less well:

The complex model predicts a move in the wrong direction that is too large by more than 2 orders of magnitude. It is quite a spectacular failure – roughly akin to predicting a 2-0 win for a team that goes on to lose 7-1. This is not surprising as, with a quadratic signal, four of the seven parameters could only fit noise; but in the real world we never know for sure what the signal really is, so this shows how dangerous overfitting models can be3.
It is important to realise that this is not purely a quirk of adding excess parameters – at the most basic level, using more parameters really just means trying more models. Over-fitting is an example of the more basic statistical concept, survivorship bias; the fundamental point is that, if you try enough models, one will seem to work, simply by capturing the noise in a data set. Moreover, capturing the noise can lead to very wrong predictions out-of-sample. In finance, even with simple models, there are typically many parameters to fit simply due to the scope of the investible universe; and when backtesting a strategy, not only every factor that is used in the strategy, but also every factor that is considered and dismissed, represents more freedom in the model and, therefore, more scope to overfit.
Because of this, survivorship bias is a serious issue, especially in finance where a vast number of different models and strategies are used. At its heart, there are lots and lots of very clever people, trying out an enormous number of variations on a huge variety of ideas, all using basically the same data-set. Even if one manager does only try one backtest, this should really be seen in the context of all the backtests run by the whole gamut of managers, banks and other market participants; the number of trials clearly blows up very quickly. So how can you actually test out ideas; and how can you be confident that a strategy works? Ultimately, you will always be constrained, but the good news is that this problem can be mitigated, and I discuss how in my next blog.
3 There is a fair amount of cherry-picking in this demonstration: for example, by choosing a quadratic as the underlying we built in an explicit drift, rather than simulating independent identical variables; and by allowing terms of order 6, we allowed more spectacular errors. However, as a meta-example of backtest overfitting, (in that we are overfitting a test of how dangerous overfitting is), it is arguably more powerful.