Distributed Multi-Armed Bandits

Slot Machines

Imagine that you are in a room filled with slot machines. The slot machines have no screens or lights or any advertising. They are all painted a dull grey, with a single number in large white print so that they can be easily identified. There are rows and rows of these slot machines extending as far as you can see in every direciton. You have been given $100 in casino chips.

You walk up to the first machine, labelled #1, put in $1 of chips (that seems safe, right?), and pull the lever. $3 dollars in chips come out. You try it again, but this time only $0.50 comes out. Still optimistic about this machine, you continue putting in a dollar at a time, and most of the time it spits out more money than you put in. After a few minutes, you have made about $10.

You decide to push your luck with the next machine, labelled #2. You put in a dollar, and pull the lever. It spits out $0.10 in chips. Disappointed, you bet $1 and pull the lever again, $0.50 comes out. You already have a cautious dislike towards machine #2.

Taking a moment to strategize, you consider going back to machine 1, that was working pretty well. But what if you are just unlucky with machine #2? There are more machines; Should you try machine #3? That could be more like machine #1, but it could also be more like machine #2.

This is the Multi-Armed Bandit problem. There are many variations and strategies which you can read about. The best strategies all have some common features.

  • Track the payouts for each machine. Information is valuable, don’t ignore it.

  • Try to predict the distribution of payouts for each machine. If the next payout is predictable, then the data you collected will help. If it isn’t, then there is no cost to trying to predict it other than wasted mental effort.

  • Try to predict the distribution of payouts for the next machine that you haven’t tried yet. If the distribution is predictable, then the data you collected from previous machines will help. If it isn’t, then there is no cost other than wasted mental effort.

  • Balance exploiting known good machines, with learning about the payoffs of new machines. In the beginning this will be an act of faith, but as you sample more machines your beliefs about the next machine will take shape.

Central Planning

Let’s consider another imaginary scenario. Imagine that a dictatorship has just toppled. While traveling the world, you have found yourself in the right place at the right time, and you have filled the power vacuum left by the previous dictator. The dust of the revolution settles, and you are fulfilling the day-to-day duties of a national leader. Your cabinet is actually mostly people from the previous regime. They tell you that they just want to siphon money from the country and enrich themselves as much as possible. They got rid of the previous dictator because he was bad at doing this. If you do a good job, you can keep your earnings, and eventually go home. If you don’t, then you’ll meet the same fate as the previous dictator. The country had a centrally planned economy under the previous dictator, and now everyone is looking to you for an economic plan.

For the sake of simplicity, lets assume that your country uses dollars and doesn’t have it’s own currency. Your country trades with other countries using dollars, so dollars are a good measure of productive activity.

To enrich yourself and your cabinet, all you need to do is make the country’s economy as productive as possible, while taking as much as possible for you and your cabinet. Any resources that you take out of the economy to pay to your cabinet, are resources that cannot be used to generate more resources. Fundamentally, maximizing productivity is an optimization problem. Everyone in your country has about 16 hours awake to perform labor. Additionally there are natural resources, and products that people have made. Most things are in the physical possession of a single person or a family, but you are the dictator and all resources belong to the state, so you can take them if you want. Your cabinet doesn’t really like to use the “slavery” word, but you can also arbitrarily direct the labor of your citizens.

Day Plans

First you realize that you have no idea how to make all of the things that the country needs to function. You could not sit down and dictate exactly what everyone should spend their time doing without everything falling apart. You don’t know how to grow crops; you don’t know how to harvest lumber. You are going to need to delegate that. Not sure who to trust, you decide to host open office hours where citizens can come and pitch their plan for the day to you. You will approve or disapprove and allocate resources accordingly. If someone has a plan that you really like, you can assign other people and resources to it. If someone has a plan that you don’t like, you can veto it and assign them and their resources to another plan. At the end of the day all resources must be registered with the state, and any dollars turned into the treasury pool so they can be reallocated the next day.

You think back to your previous day dream of the casino with the dull grey slot machines. Each citizen’s labor is like 16 hours of casino chips that they start with every day. The country’s resources are like the pool of casino chips that have accumulated. Each citizen’s day plan is like a slot machine, which has not been explored. Their plan might be profitable, it might not. Just like with the slot machines, recording the payoffs for each plan is valuable, and useful for estimating future payoffs. If there is predictable structure in the value of day-plans, then you benefit from recognizing and exploiting it. It makes sense to bet on slot machines and plans that have done well previously.

Back in your imaginary regime, things are humming along. You have your cabinet keeping detailed records of each citizen’s day-plans and how productive they are. Your cabinet builds statistical models, and each citizen has a probability distribution associated with the expected payoff of their next day-plan. A few prominent citizens are consistently producing huge payoffs, and you are getting good results by allocating more and more resources to them.

Even though your system is starting to look like a free-market economy. Your citizens are not free. There have no property of their own, and everyone lives in state assigned housing, at the exact same standard of living. They have no control over their time, and how they spend their time must be approved by you. Most of your citizens rarely decide how they spend their days. When you allocate resources to them, you get less than you put in, and so you assign them to be part of someone elses day-plan. Every once in a while they get to try one of their own day-plans, but usually it doesn’t work out and they get assigned back to someone elses day-plan again. A small amount of citizens plan most citizen’s days.

Week Plans

Eventually after a few weeks of hosting non-stop office hours for your citizens to pitch day-plans, you decide you want a break. What’s the point of being a dictator if you have to work all the time? You realize that your job, meeting with people and approving or vetoing day-plans, could also be done by someone else. The same algorithm that you are using for approving day-plans could be used to test if other people are good at choosing day-plans. It just so happens that people have been pitching this to you already. Previously you were cautiously untrusting, but now you are willing to take a chance.

After a few iterations, you are able to determine who is good at approving and vetoing day-plans and who isn’t. We’ll call this skill meta-day-planning. You have your best meta-day-planners running their own office hours all over the country. Eventually all of the resources are managed daily by these individuals. You spend 1 day a week reviewing the weekly performance of your meta-day-planners. Sometimes they do a bad job and you reassign them, giving their role to someone else who wants to be a meta-day-planner. Other times you only shuffle funds around, moving resources from worse planners to better planners, but letting everyone keep their current role. Now your cabinet spends their time keeping records and building statistical models for the expected weekly payoff of each meta-day-planner.

Distributed Planning Algorithms

The imaginary dictatorship example has some problems. It’s difficult to coordinate everyone’s time as finely grained as described. Having everyone travel to office hours at the start of the day imposes overhead. They could spend that travel time doing something productive. The day-planning approval process is directly consuming labor hours.

You could try to plan less frequently, but then you have to bet more labor per encounter. Assuming you and your central planners always know best, there is a tradeoff between planning more frequently, making better decisions, and planning less frequently, making worse decisions.

It would be great if there was some sort of system which automatically bet more resources on good day-plans. Then you wouldn’t have to do anything, you could just setup the rules, sit back and watch it go.

What about a system that bets a small fixed amount on any unproven day-plans, and then reinvests 100% of surplus from profitable day plans? If someone pitches a day-plan, and it yields extra dollars, those dollars automatically go towards whatever they want to do the next day. Rather than traveling to your office, handing over the dollars, and pooling them in the treasury for tomorrow, your citizens keep the dollars, and are free to allocate them again.

It turns out this algorithm has already been discovered many times across many civilizations. It’s an elegant algorithm that is easy to implement because of its simplicity. It’s usually called property rights.

The fixed amount of chips that each citizen gets every day is their own labor. This is considered a resource which the citizen controls. It corresponds to exploration in the multi-armed bandit problem. Everyone is free to explore new projects on their own time. You could say that slavery as a policy sets the exploration parameter sub-optimally low. The chips that are spit out by the slot machine are like the resources acquired or manufactured during that day. Taking all of those chips and putting them right back in the machine and pulling the lever is like allowing the person that made them to control them.

There is no moral philosophizing necessary to justify property rights. You don’t have to believe in rights at all. In the imaginary dictatorship, all resources, including labor, belong to the government. The point of the economy was to enrich you, the dictator, and your cabinet. Even given that end, property rights emerged as a simple algorithm, much less complicated and less costly than planning the economy yourself, or with a hierarchy of meta-day-planners.