Kelly betting vs expectation maximization

People talk about Kelly betting and expectation maximization as though they’re alternate strategies for the same problem. Actually, they’re each the best option to pick for different classes of problems. Understanding when to use Kelly betting and when to use expectation maximization is critical.

Most of the ideas for this came from Ole Peters ergodicity economics writings. Any mistakes are my own.

The parable of the casino

Alice and Bob visit a casino together. They each have $100, and they decide it’ll be fun to split up, play the first game they each find, and then see who has the most money. They’ll then keep doing this until their time in the casino is up in a couple days.

Alice heads left and finds a game that looks good. It’s double or nothing, and there’s a 60% chance of winning. That sounds good to Alice. Players buy as many tickets as they want. Each ticket resolves independently from the others at the stated odds, but all resolve at the same time. The tickets are $1 each. How many should Alice buy?

Bob heads right and finds a different game. It’s a similar double or nothing game with a 60% chance of winning. He has to buy a ticket to play, but in Bob’s game he’s only allowed to buy one ticket. He can pay however much he wants for it, then the double or nothing is against the amount he paid for his ticket. How much should he pay for a ticket?

Alice’s game is optimized by an ensemble average

Let’s estimate the amount of money Alice will win, as a function how many tickets she buys. We don’t know how each ticket resolves, but we can say that approximately 60% of the tickets will be winners and 40% will be losers (though we don’t know which tickets will be which). This is just calculating the expected value of the bet.

If she buys tickets, she’ll make dollars. This is a linear function that monotonically increases with , so Alice should buy as many as she can.

Since she has $100, she can buy 100 tickets. That means she will probably come away with $120. There will be some variance here. If tickets were cheaper (say only a penny each), then she could lower her variance buy buying more tickets at the lower price.

Bob’s game is optimized by a time-average

Unlike Alice’s game with a result for each ticket, there’s only one result to Bob’s game. He either doubles his ticket price or gets nothing back.

One way people tend to approach this is to apply an ensemble average anyway via expectation maximization. If you do this, you end up with basically the same argument that Alice had and try to bet all of your money. There are two problems with this.

One problem is that Alice and Bob are going to repeat their games as long as they can. Each time they do, they’ll have a different amount of money available to bet (since they won or lost on the last round). They want to know who will have the most at the end of it.

The repeated nature of these games mean that they aren’t ergodic. As soon as someone goes bust, they then can’t play any more games. If Bob bets the same way Alice does, and goes all in, then he can only get $0 or double out. After one round, he’s 60% likely to be solvent. After 10 rounds, he’s only likely to have any money at all. That’s about half a percent, and Bob is likely to spend the last few days of their trip chilling in the bar instead of gaming.

The second problem with expected value maximization here is that expected value is a terrible statistic for this problem. In Alice’s game, her outcomes converge to the expected value. In Bob’s game, his outcomes if he expectation maximizes are basically as far from the expected value as they can be.

This is why Bob should treat his game like a time-average. I highly recommend Ole Peters’s paper deriving time-average statistics for the St. Petersburg paradox to fully understand this, but I’ll give an overview of one derivation here.

As an intuition pump, let’s look at the traditional expected value calculations. You first break a single result up into different “slices”. You have one slice for each possible outcome, and you scale each slice by the probability of the outcome and the value of the outcome. Then you sum. In equations, .

The time average starts similarly. Instead of breaking a single outcome up, you break up the single event in time. In Bob’s case, we split the one time event up into two sections. One section is the win section, and it’s 60% of the time. One section is the loss section, and it’s 40% of the time. Each slice scales your bankroll according to the exponential growth formula, so you have to know the growth factor for the options.

Growth factor depends on how much you start with and how much you bet. Bob starts with $100 and bets , so his growth factor on a win would be , and on a loss it would be .

Then bankroll scaling happens multiplicatively. After the winning portion of time, you’d have . Then after the losing portion you’d have . That’s your time average, and you want to maximize it. In equations, a time average looks like , where is the growth factor for outcome .

A mathematician would say that Bob should bet whatever maximizes where is how much he paid for a ticket. We find the argmax more easily by taking the log to turn this into a sum of products. In other words, we want the that maximizes . After a little differentiation and some algebra, we find that optimal value ends up being .

An important note here is that we used a logarithm to simplify the math, but we are not actually interested in maximizing the log of the value. We are maximizing our temporal average, and the logarithm is just a mathematical trick that makes finding the argmax easier.

The result here is the Kelly criterion. If Bob spends 20% of his bankroll for each ticket over multiple runs, his long run growth factor will converge to about 1.02.

What should Bob actually do?

The mathematician would tell Bob to Kelly bet on his game, but a stock trader would tell Bob to find a better game.

Alice’s long run rate of return converges to 20% per turn. Bob’s converges to about 2% per turn. Alice is doing much better than Bob, because she can access the ensemble return of the stakes.

Some arguments against Kelly betting, such as Abram Demski’s post here, note correctly that the ensemble average is higher than what you can get with Kelly betting. What those arguments don’t take into account is that there are many wagers where the ensemble average is just not available as an outcome.

If you can ensemble average, then you definitely should. If you can’t ensemble average, then maybe you shouldn’t bet at all. This is actually common wisdom among non-math people. Few are the parents who would advise their children to make their fortune through gambling games.

Gambling is unpopular as a way to make a living because people have learned, through long and horrible history, that gambling games don’t give you a good return. Even if the games are fair and you’re good at them. Even if you bet Kelly.

When to bet with the Kelly criterion

An enormous amount of time and energy over the past few centuries has been spent designing mechanisms that allow people to access ensemble average returns from inherently non-ensembled bets. This is what much of modern portfolio theory is about. It’s why index funds exist, and even a large part of the reason for mutual funds. Even VCs invest in many startups, knowing that the ensemble average will be high even if most individual startups go bust.

There are important cases where the ensemble average doesn’t apply. We got a taste of one of them when Sam Bankman-Fried infamously said he’d St. Petersburg paradox the universe. There’s a good writeup of this over at Taylor Pearson’s blog, but the short story is that you should not take double or nothing bets with the whole universe.

Here’s when double or nothing with the whole universe makes sense: when you have a huge number of fully fungible universes that you share value among after the fact. In other words, when you can access the ensemble average. I don’t know about you, but I only have the one universe.

I also only have the one life. What I choose to spend my time on, how I choose to live, the idea of the Kelly criterion can apply to these too. Mostly in the form of aphorisms like “keep some powder dry” or maintain slack in your schedule.

There’s one final place that I find a lack of ensemble access to be important: if we create an AI superintelligence that takes major actions affecting the future of humanity. I want it to be able to figure out when it should Kelly bet, and when it’s ok to expectation maximize for an ensemble. I wouldn’t want SBF to double-or-nothing our universe, and I don’t want AI to do it either.