# Crypto quant trading: Naive Bayes

Pre­vi­ous post: Crypto quant trad­ing: Intro

I didn’t get re­quests for any spe­cific sub­ject from the last post, so I’m go­ing in the di­rec­tion that I find in­ter­est­ing and I hope the com­mu­nity will find in­ter­est­ing as well. Let’s do Naive Bayes! You can down­load the code and fol­low along.

Just as a re­minder, here’s Bayes’ the­o­rem: `P(H|f) = P(H) * P(f|H) /​ P(f)`. (I’m us­ing `f` for “fea­ture”.)
Here’s con­di­tional prob­a­bil­ity: `P(A|B) = P(A,B) /​ P(B)`

Dis­claimer: I was learn­ing Naive Bayes as I was writ­ing this post, so please dou­ble check the math. I’m not us­ing 3rd party libraries so I can fully un­der­stand how it all works. In fact, I’ll start by de­scribing a thing that tripped me up for a bit.

## What not to do

My origi­nal un­der­stand­ing was: Naive Bayes ba­si­cally al­lows us to up­date on var­i­ous fea­tures with­out con­cern­ing our­selves with how all of them in­ter­act with each other; we’re just as­sum­ing they are in­de­pen­dent. So we can just ap­ply it iter­a­tively like so:

``````P(H) = prior
P(H) = P(H) * P(f1|H) /​ P(f1)
P(H) = P(H) * P(f2|H) /​ P(f2)
``````

You can see how that fails if we keep up­dat­ing `P(H)` up­wards over and over again, un­til it goes above 1. I did math the hard way to figure out where I went wrong. If we have two fea­tures:

``````P(H|f1,f2) = P(H,f1,f2) /​ P(f1,f2)
= P(f1|H,f2) * P(H,f2) /​ P(f1,f2)
= P(f1|H,f2) * P(f2|H) * P(H) /​ P(f1,f2)
= P(H) * P(f1|H,f2) * P(f2|H) /​ (P(f1|f2) * P(f2))
Then be­cause we as­sume that all fea­tures are in­de­pen­dent:
= P(H) * P(f1|H) * P(f2|H) /​ (P(f1) * P(f2))
``````

Looks like what I wrote above. Where’s the mis­take? Well, Naive Bayes ac­tu­ally says that all fea­tures are in­de­pen­dent, con­di­tional on H. So `P(f1|H,f2) = P(f1|H)` be­cause we’re con­di­tion­ing on H, but `P(f1|f2) != P(f1)` be­cause there’s no `H` in the con­di­tion.

One in­tu­itive ex­am­ple of this is a spam filter. Let’s say all spam emails (`H` = email is spam) have ran­dom words. So `P(word1|word2,H)=P(word1|H)`, i.e. if we know email is spam, then the pres­ence of any given word doesn’t tell us any­thing about the prob­a­bly of see­ing an­other word. Whereas, `P(word1|word2) != P(word1)` since there are a lot of non-spam emails, where word ap­pear­ances are very much cor­re­lated. (H/​t to Satvik for this clar­ifi­ca­tion.)

This is ac­tu­ally good news! As­sum­ing `P(f1|f2) = P(f1)` for all fea­tures would be a pretty big as­sump­tion. But `P(f1|H,f2) = P(f1|H)`, while of­ten not ex­actly true, is a bit less of stretch and, in prac­tice, works pretty well. (This is called con­di­tional in­de­pen­dence.)

Also, in prac­tice, you ac­tu­ally don’t have to com­pute the de­nom­i­na­tor any­way. What you want is the rel­a­tive weight you should as­sign to all the hy­pothe­ses un­der con­sid­er­a­tion. And as long as they are mu­tu­ally ex­clu­sive and col­lec­tively ex­haus­tive, you can just nor­mal­ize your prob­a­bil­ities at the end. So we end up with:

``````for each H in HS:
P(H) = prior
P(H) = P(H) * P(f1|H)
P(H) = P(H) * P(f2|H)
etc…
nor­mal­ize all P(H)’s
``````

Which is close to what we had origi­nally, but less wrong.… Okay, now that we know what not to do, let’s get on with the good stuff.

## One feature

For now let’s con­sider one very straight for­ward hy­poth­e­sis: the clos­ing price of the next day will be higher than to­day’s (as a short­hand, we’ll call to­mor­row’s bar an “up bar” if that’s the case). And let’s con­sider one very sim­ple fea­ture: was the cur­rent day’s bar up or down?

Note that even though we’re graph­ing only 2017 on­wards, we’re up­dat­ing on all the data prior to that too. Since 2016 and 2017 have been so bullish, we’ve ba­si­cally learned to ex­pect up bars un­der ei­ther con­di­tion. I guess HODLers were right af­ter all.

## Us­ing more re­cent data

So, this ap­proach is a bit sub­op­ti­mal if we want to try to catch short term moves (like en­tire 2018). In­stead, let’s try to look at most re­cent data. (Ques­tion: does any­one know of Bayes-like method that weighs re­cent data more?)

We slightly mod­ify our al­gorithm to only look at and up­date on the past N days of data.

It’s in­ter­est­ing to see that it still takes a while for the al­gorithm to catch up to the fact that the bull mar­ket is over. Just in time to not to­tally get crushed by the Novem­ber 2018 drop.
In the note­book I’m also look­ing at shorter terms. There are some in­ter­est­ing re­sults there, but I’m not go­ing to post all the pic­tures here, since that would take too long.

As we look at shorter and shorter timeframes, we are in­creas­ingly likely to run into a timeframe where there are only up bars (or only down bars) in our his­tory. Then `P(up)=1`, which doesn’t al­low us to up­date. (Some con­di­tional prob­a­bil­ities get messed up too.) That’s why we had to dis­able the pos­te­rior as­sert in the last code cell. Cur­rently we just don’t trade dur­ing those times, but we could in­stead as­sume that we’ve always seen at least one up and one down bar. (And, like­wise, for all fea­tures.)

The re­sults are not differ­ent for longer timeframes (as we’d ex­pect), and mostly the same for shorter timeframes. We can reen­able our pos­te­rior as­sert too.

## Bet sizing

Cur­rently we’re bet­ting our en­tire port­fo­lio each bar. But in the­ory, our bet should prob­a­bly be pro­por­tional to how con­fi­dent we are. You could in the­ory use Kelly crite­rion, but you’d need to have an es­ti­mate of the size of the next bar. So for now we’ll just try lin­ear scal­ing: `df[“strat_sig­nal”] = 2 * (df[“P(H_up_bar)”] − 0.5)`

We get lower re­turns, but slightly higher SR.

## Ig­no­rant priors

Cur­rently we’re com­put­ing the prior for `P(next bar is up)` by as­sum­ing that it’ll es­sen­tially draw from the same dis­tri­bu­tion as the last N bars. We could also say that we just don’t know! The mar­ket is re­ally clever, and on pri­ors we just shouldn’t as­sume we know any­thing: `P(next bar is up) = 50%`.

``````# Com­pute ig­no­rant pri­ors
for h in hy­pothe­ses:
df[f”P(H_{h})”] = 1 /​ len(hy­pothe­ses)
``````

Wow, that does sig­nifi­cantly worse. I guess our pri­ors are pretty good.

## Homework

• Ex­am­ine cur­rent fea­tures? Are they helpful /​ do they work?

• We’re pre­dict­ing up bars, but what we ul­ti­mately want is re­turns. What as­sump­tions are we mak­ing? What should we con­sider in­stead?

• Figure out other fea­tures to try.

• Figure out other cre­ative ways to use Naive Bayes.

• It seems like at the end of a fairly com­pli­cated con­struc­tion pro­cess that if you wind up with a model that out­performs, your prior should be that you man­aged to sneak in overfit­ting with­out re­al­iz­ing it rather than that you ac­tu­ally have an edge right? Even if, say, you wound up with some­thing that seemed safe be­cause it had low var­i­ance in the short run, you’d sus­pect that you had man­aged to push the var­i­ance out into the tails. How would you de­ter­mine how much test­ing would be needed be­fore you were con­fi­dent plac­ing bets of ap­pre­cia­ble size? I’m guess­ing there’s stuff re­lated to struc­tur­ing your stop losses here I don’t know about.

• Yes, avoid­ing overfit­ting is the key prob­lem, and you should ex­pect al­most any­thing to be overfit by de­fault. We spend a lot of time on this (I work w/​Alexei). I’m think­ing of writ­ing a longer post on pre­vent­ing overfit­ting, but these are some key parts:

• The­ory. Some­thing that makes eco­nomic sense, or has worked in other mar­kets, is more likely to work here

• Com­po­nents. A strat­egy made of 4 com­po­nents, each of which can be in­de­pen­dently val­i­dated, is a lot more likely to keep work­ing than one black box

• Mea­sur­ing strat­egy com­plex­ity. If you ex­plore 1,000 pos­si­ble pa­ram­e­ter com­bi­na­tions, that’s less likely to work than if you ex­plore 10.

• Al­gorith­mic de­ci­sion mak­ing. Any man­ual part of the pro­cess in­tro­duces a lot of pos­si­bil­ities for overfit.

• Ab­strac­tion & reuse. The more you reuse things, the fewer de­grees of free­dom you have with each idea, and there­fore the lower your chance of overfit­ting.

• As an ex­am­ple, con­sider a strat­egy like “on Wed­nes­days, the mar­ket is more likely to have a large move, and sig­nal XYZ pre­dicts big moves ac­cu­rately.” You can en­code that as an al­gorithm: trade sig­nal XYZ on Wed­nes­days. But the al­gorithm might make money on back­tests even if the as­sump­tions are wrong! By ex­am­in­ing the in­di­vi­d­ual com­po­nents rather than just whether the al­gorithm made money, we get a bet­ter idea of whether the strat­egy works.

• Is this an in­stance of the “the­ory” bul­let point then? Be­cause the prob­a­bil­ity of the state­ment “trad­ing sig­nal XYZ works on Wed­nes­days, be­cause [spe­cific rea­son]” can­not be higher than the prob­a­bil­ity of the state­ment “trad­ing sig­nal XYZ works” (the first state­ment in­volves a con­junc­tion).

• It’s a com­bi­na­tion. The point is to throw out al­gorithms/​pa­ram­e­ters that do well on back­tests when the as­sump­tions are vi­o­lated, be­cause those are much more likely to be overfit.

• Yes to ev­ery­thing Satvik said, plus: it helps if you’ve tested the al­gorithm across mul­ti­ple differ­ent mar­ket con­di­tions. E.g. in this case we’ve looked at 2017 and 2018 and 2019, each hav­ing a pretty differ­ent mar­ket regime. (For other as­sets you might have 10+ years of data, which makes it eas­ier to be more con­fi­dent in your find­ings since there are more crashes + weird mar­ket regimes + un­der­ly­ing as­sump­tions chang­ing.)

But you’re also get­ting at an im­por­tant point I was hint­ing at in my home­work ques­tion:

We’re pre­dict­ing up bars, but what we ul­ti­mately want is re­turns. What as­sump­tions are we mak­ing? What should we con­sider in­stead?

Ba­si­cally, it’s pos­si­ble that we pre­dict the sign of the bar with a 99% ac­cu­racy, but still lose money. This would hap­pen if ev­ery time we get the pre­dic­tion right the price move­ment is rel­a­tively small, but ev­ery time we get it wrong, the price moves a lot and we lose money.
Stop losses can help. Another way to miti­gate this is to run a lot of un­cor­re­lated strate­gies. Then even if the mar­ket con­di­tions be­comes par­tic­u­larly ad­ver­sar­ial for one of your al­gorithms, you won’t lose too much money be­cause other al­gorithms will con­tinue to perform well: https://​​www.youtube.com/​​watch?v=Nu4lHaSh7D4

• That sounds equiv­a­lent to kelly crite­rion, that most of your bankroll is in a low var­i­ance strat­egy and some pro­por­tion of your bankroll is spread across strate­gies with vary­ing amounts of higher var­i­ance. Is there any ex­ist­ing work on kelly op­ti­miza­tion over dis­tri­bu­tions rather than points?

edit: full kelly al­lows you to get up to 6 out­comes be­fore you’re in 5th de­gree polyno­mial land which is no fun. So I guess you need to choose your points well. http://​​www.elem.com/​​~btilly/​​kelly-crite­rion/​​

• I’ve read over briefly both this ar­ti­cle and the pre­vi­ous one in the se­ries. Thank you for putting these to­gether!

What I’m cu­ri­ous about in quant trad­ing is the ac­tual im­ple­men­ta­tion. Once you, say, have a model which you think works, how im­por­tant is la­tency? How do you make de­ci­sions about when to buy /​ sell? (Par­tially echo­ing Romeo’s sen­ti­ment about cu­ri­os­ity around stop losses and the ac­tual nitty-gritty of ex­tract­ing value af­ter you think you’ve figured some­thing out.)

• In this case the la­tency is not a big is­sue be­cause you’re trad­ing on day bars. So if it takes you a few min­utes to get into the po­si­tion, that seems fine. (But that’s some­thing you’d want to mea­sure and track.)
In these strate­gies you’d be hold­ing a po­si­tion ev­ery bar (long or short). So at the end of the day, once the day bar closes, you’d com­pute your sig­nal for the next day and then en­ter that po­si­tion. If you’re go­ing to do stop-losses that’s some­thing you’d want to back­test be­fore im­ple­ment­ing.

Over­all, you’ll want to start trad­ing some amount of cap­i­tal (may be 0, which is called pa­per trad­ing) us­ing any new strat­egy and track its perfor­mance rel­a­tive to your back­test re­sults + live re­sults. A dis­crep­ancy with back­test re­sults might sug­gest overfit (most likely) or mar­ket con­di­tions chang­ing. Discrep­ancy with live re­sults might be a re­sult of or­der la­tency, slip­page, or other fac­tors you haven’t ac­counted for.

• I see. Thanks for pro­vid­ing the ad­di­tional info!