Magic Arena Bot Drafting

Link post

Epistemic Status: Attempting to be useful and to learn via exploration. Real proposals. Content assumes familiarity with Magic: The Gathering booster drafting, but does not much depend on a knowledge of deep Magic strategy, so it holds potential interest for those interested in game narrow AI and machine learning.

Reflections Partially Brought on By (Eric Moyer @ Channel-Fireball): Drafting Like a Computer Part 1, Part 2

Part 1 – The Problem is Hard

I appreciated Eric Moyer’s articles because they show that building a good bot is hard. They do so by failing to design good bots. Eric’s proposed bots are overly formulaic and predictable, with obvious holes in their systems. If you are feeding one of these bots and know their rankings, you can narrow down their whole path based on possible first pick colors and strengths in highly exploitable ways. The proposed bots will reliably fail to take into account holistic information, they won’t draft playable decks or try to shore up their weaknesses or strengths, and so on, and so on.

That’s not because Eric isn’t making a reasonable first attempt here. It’s because the problem is hard! Programming a bot to draft reasonably is hard.

What we want above all is a bot experience that feels human, where you’re trying to figure out what is going on, cannot ever fully rely on your hunches, and need to consider all the information at your disposal. Where you get to make hero reads, cram decks down the throats of other drafters, sweat our attempts to wheel key cards, and otherwise have a fresh, exciting and new experience more often.

Making this even more difficult is that bots that we know are bots will inherently feel like bots, and that bots will get blamed even more than humans for mistakes. When bots mimic human patterns of doing things that look crazy, the bot often looks crazy rather than realistic or human.

As an extreme example. if a person passes us an Ugin, we know that a human made a horrible mistake. Maybe someone didn’t see it, or misclicked, or is trying something weird. And we can be happy for our good fortune. If a bot passed it, as happened to Gerry Thompson on the first day of War of the Spark drafts, we get deeply confused and frustrated.

Same goes for everything else the bots do. Suddenly you have someone who interacted with the problem, and is therefore responsible for it, as per Copenhagen Interpretation of Ethics. Whereas if you leave the decisions to humans, they are free to be idiots or have their own quirky reasons. That’s a big reason people don’t like the bots. Another big one is consistency. The bots reliably do similar things in similar spots. You can predict easily which cards will wheel, and which cards one can pick up cheap, and those answers rarely change. It’s easy to construct decks that rely on the cards the bots don’t want, and reliably draft those decks.

Always drafting three of the same set makes this that much more reliable – I know what the market research says, but I miss two-set and three-set drafting. Alas.

This leads to three broad conclusions.

First, humans really are better than bots even if the bots were as good or better at drafting real decks and/or strong cards. So we should use human drafters wherever possible. Humans inherently have understandable logic but also variation and unpredictability, and are viewed as their own agents rather than part of a blameworthy system. We can cry over how this is unfair, or we can accept that bots will be held to a higher and more complex standard than we would like. This means, among other things, making sure that certain crazy things do not happen with bots, even if they happen in real life.

Second, bots need to appear unpredictable and inconsistent to give a good experience, which requires their decision process to be complex. But without looking too much like idiots. Each bot should come with its own set of preferences and places of emphasis, over cards and over colors, and ideally other things as well such as how much it cares about what it is forcing, what signals it gets, its mana curve, and so on. We need to mix up who and what you are dealing with, so you can’t play them like a fiddle, or rely on them to act the way you expect. And so part of a draft if figuring out which types of bots you are dealing with, in addition to what might be open.

Third, bots do not need to be geniuses that go super deep with logic to figure out their pods and situations, or otherwise do something amazing. It would be cool if they were, but the players won’t notice, so there’s no point. Who cares if the bots improve their average deck strength in this way? It might even be bad, because very few real players are geniuses. We can’t all be Ben Stark.

Part 2 – What a Problem Solution Might Look Like

Saying a problem is hard is important, describing how it is hard is a valuable unhint – my high school classmate Daniel Strong coined that term to describe information that got you closer to finding the solution, but made your estimate of your distance from the solution go up. I am sad it never caught on.

Ideally, it will be a solution good enough that we can offer something deeply cool: To show, after the draft is completed, what the bots did. You should be able to see what the bot passing to you was up to, and how the bot you passed to handled the signals you sent its way. Why not? It would be super cool, and only begin to scratch the surface of improving the drafting experience. When I make my own game with its own limited formats, the sky’s the limit on the awesome.

With or without such lofty goals, a first sketch of a solution is almost always better than mumbling about how it would be great to have a solution at all.

So here’s my proposal for how to build better draft bots.

The basic idea is that each bot has a set of attributes that they use to guide how they draft. As a first proposal, a bot will care about the following things:

Color, color combination and archetype preference.
Individual card rankings, in general and/or within color combinations and archetypes.
Speculativeness, which is how much they discount a card when they aren’t sure they can play it.
Preferred number of creatures, removal and tricks, and how much they care.
Mana curve.
Signal received, and what constitutes a signal.
Signal given, and what constitutes a signal (that second part will match in both directions).
Stubbornness (how much they attach themselves to early picks).
Cost of commitment (how much they want to commit or stay open).
Rare drafting.
Mana fixing, and willingness to go 3+ colors.

That seems like a lot, and it is. You can certainly implement a version that ignores some or most of this list, and still does a reasonable drafter impression. Some of these might even end up being counter-productive from the point of view of the average user, who never sees the intricacies.

The basic idea is that for every pick, all of these will function as weights. When a draft is created, each of the seven computer drafters is instantiated with a randomized set of preferences over everything on the list, with choices correlating somewhat if they are observed to be correlated among human drafters.

The computer will add up all considerations for each card, then pick the card with the highest combined weight for all of one’s picks after making that pick, plus an adjustment for how open one is and thus the expected value of remaining packs. Thus, if a pick increases the computer’s estimate of your chance of playing red from 75% to 90%, that increases the value of all chosen red cards, not only the card you’re picking, and reduces the value of any chosen other cards. Getting these estimates right will be the trickiest part of the whole equation.

Consider the pack shown in the Drafting Like a Computer article:

The numbers on the right are LSV’s set review’s ratings for each card. In his model, the reasoning would go something like “There’s a 1-color 3.5, which wins over a 2-color 3.5 and a 1-color 3.0.”

In our version, the numbers would be less even. We’d effectively see the bot saying something like “If we take the red 3.5, we’re passing a primary black and green pack to the left (+0.3), and we have only a small red commitment (-0.1) and can likely run it (85% of value). Whereas if we take the rare, we get a rare (+0.3), which is good, but our color commitments on both sides are high (-0.5) and two of the three best remaining cards are in those two colors (-0.1).” All of these magnitudes would be different for different bots, to simulate different drafters.

Thus it would sometimes say “But I really like black/green in this format (+0.3) and dislike red (-0.2), and a rare is a rare (extra +0.3 for +0.6 total), and besides I don’t think the red card is all that (-0.2)” and take the rare, whereas other times it will say “I highly value staying open (+0.3) and don’t care about rares at all, red is pretty good (+0.1) and I just want to win (+0 rather than +0.3), and take a red common instead.

Then in pick 2:

It would now consider signal passed, signal received, and so on. Since each bot would have a different card value for each card, each bot would care differently about what it passed to the left, and care read differently what was passed to it from the right. This would make bots that much more divergent, which is very good. Already here at pick two, there will be important uncertainty in what happens, with a few extreme drafters choosing to further commit to red, and some disliking Ugin’s Construct, despite a seemingly clear choice. Most drafters will care more about staying open and reliably playing cards, occasionally you’ll see one that prefers to commit more to improve the signal and wants to take a second red card here. Ideally that would be rare due to the backup red cards available, which would typically reduce the measured impact of cutting each card. And so on.

Later in the draft, the bots will deal more and more with their own unique problems and circumstances.

Player rank should impact the quality of the opposition if and only if players are paired via ranking. Otherwise, bot quality should be maximized for high-stakes events, and perhaps lowered somewhat for low-stakes events. We don’t want to punish drafters in unranked matches for having a high ranking.

Part 3 – Getting There From Here

It’s all well and good to lay out a grand vision for how a bot should act. It is quite another to explain how to program the damn thing. A lot of the things I suggest in Part 2 won’t be easy to get right. An initial version will doubtless be stripped down dramatically from this final vision, with far less considerations and a much simpler method of color selection.

Thus, this third section won’t aim to get us all of the way there. This is my vision of how to get a minimum viable product, that will fix the bulk of the obvious issues that players observe, and get the bots drafting reasonable decks.

That means solving a few problems. Here’s my take on how to do that.

Whenever possible, throughout this process, we’re going to use the simplest and most reliable indicator of all, and rely on the players themselves and their draft choices.

Problem 1: Baseline Card Rankings

Before one can want to randomize card values, one must first get initial card values right. This seems to be an ongoing issue, which is surprising. We can divide the problem in two: The problem before we have a substantial number of human drafts to reference, and the problem after that.

Once we have enough human drafts, the baseline ranking of a card is how likely drafters of appropriate rank and results are to draft a given card. So take all Arena accounts that have at least ten (ranked plus traditional) matches with a 60%+ win rate, say, and look at their average pick number for each card, then adjust for rarity (since rare cards don’t have to compete with other rare cards), such that the bots would draft the cards with the same average pick number as the humans, and you’re good.

Magic Online can be used to jump-start this process, using all Magic Online drafters as the initial sample. One can then update in real time, by having new pods read a file containing current rankings. If we’re not comfortable using Magic Online to jump start, or we want Arena to go first, we can use an R&D survey, a player survey or the average of set reviews for the first hour, then use the human drafters from there.

Later, when we add more considerations, we’ll find that those other considerations alter how willing bots are to take various cards early. To avoid factoring these things twice (once for what humans do, another time for what bots do), we’ll then want to adjust our baseline ratings until simulations of all-bot drafts have the right average pick number for all cards. Basic search algorithms should be sufficient for this, as would be basic machine learning.

Thus, I will assume that we will always calibrate our other considerations using current baseline card ratings, then adjust baseline card ratings to reflect the added new consideration, and continuing to iterate as we advance. For any given set of other considerations, there should always exist (unless I’m missing something) a set of card rankings that causes average pick number to be correct for all cards. This then introduces some other issues in secondary ways, because the bots will incorrectly overvalue or undervalue cards in secondary ways, but I’m willing to worry about those problems later.

I will proceed to the next section as if this means that we’ve solved the individual card valuation problem, and it will stay solved while introducing additional complexity.

Problem 2: The Signal

We can mimic human behavior via optimizing over plausible things to consider, either on Arena, Magic Online or both. Thus, look at what pack contents being passed to you cause players to alter their picks, or what cards or patterns of cards one passes that cause players to alter their picks, and mimic those choices. We can (at least for now) abstract cards away as “Passed a red 3.2 and a red 2.9 and a blue 2.7.” So we use gradient descent or other machine learning techniques to figure out what weights to put on such matters to best mimic human choices. Again, if we want ‘advanced’ versus ‘basic’ bots, we can train on different sets of human data for both. Once we have done this for past sets, we can use the past version as the baseline early in the next set, until we have new human data.

Ideally, rather than use fully flexible machine learning techniques that will be impossible to interpret or customize, we’d fix the computer to only considering a few things and putting weights on those things. This then allows us to randomly alter those weights to create different personalities, or to let us experiment with what happens with bots that creatively use different weights. I’d rather have something we can understand, even if it performs a little worse, for many reasons. At least, until we have the bots playing out their matches, and we can evaluate them on that basis.

The big problem with this approach is that it would miss important subtlety that humans would pick up upon, such as a red card that wants one to pair it with black versus another that wants to be paired with white, and a third that wants to be a splash card. These problems are hard, hard to solve all at once, and there’s a lot of them. So in the long run, we’d want to move to a system that was smart enough to pick up on what each card was signaling, in terms of colors, color commitments and color combinations and even archetypes, rather than assume the card’s colors said it all, slash allow the system to learn arbitrary things about drafting via machine learning.

A second approach would be to not worry at first about what humans do, and instead run simulations of bots drafting against each other, trying to maximize the quality of their decks, and learn via self play. The problem with this is that it would only be as good as our evaluation of what a good deck is, which is also a problem for making good later picks, which we’ll talk about solving in earnest in problem four.

In the meantime slash as a preview, you can certainly improvise a not-too-terrible version that gives every card a value, then discounts that value based on difficulty of casting that card based on color and total mana requirements (so a splashed card would both hurt the value of the rest of the deck, and be discounted itself, and so on). That would let you get started so you could bootstrap.

Problem 3: Choosing Colors and Strategies

The trickiest and most impactful decision of a draft is what colors and strategy to pursue, and when to commit to it to varying extents. Should one stay open in one or both colors, or rush in to send a stronger signal and get to assembling the pieces? Good players have widely varying philosophies here. I would want the bots to also have widely varying philosophies, but more importantly I’d want them to have reasonable ones.

Note that the computer would be wise not to consider what it was thinking in the previous pick, when deciding on the current pick. Saying one is fully committed is an observation about future decisions rather than an alteration of them. If the facts change, I change my opinion.

The simplest and most extreme version of color selection is to make a pick in each of two colors, then only draft cards in those two colors for the rest of the draft whenever possible. That’s your deck. There are some humans who effectively follow this strategy, and there’s reason to suspect current bots are not too far from it. But its full form is rare, and it is not very effective.

A more reasonable version of this would be to value your current picks by valuing cards in your top two colors only, up to a maximum of 22-24, modified by the signals sent and received and the resulting likelihood of future value in those colors. Thus, if I pick a green 4.3 and a red 3.2, then am choosing between a red 2.1 and a blue 3.7, I would compare (4.3+3.2+2.1) to (4.3+3.7). We’d want to calibrate the point scale such that such additions were reasonable, and use what pick of the draft we are on to determine what placeholder to use for unfilled slots, which is also how we’d account for signals. So if we are ‘ahead of schedule’ we would assume the marginal card we end up with is likely pretty good, whereas if we’re ‘behind schedule’ and it’s pick 9 and we’re considering a color pair where we’d only have 2 cards so far, we’d assign a low value to (at least the last few) remaining future cards, since we’ll often be scrounging for value. We could calibrate and train such systems via all-bot drafts, once we had a reasonable first guess to work from, or from seeing what works and doesn’t work in drafts against humans.

An improved version of that would be to value all colors/pairs/strategies somewhat, but discount them by the probability you end up playing them, which is a function of how good they look in expectation right now. So sometimes, you’d take an amazing off-color card even if it likely wasn’t going to cause you to switch, because even a small chance of playing it was enough value to justify that, and then see what happens. Or, if we anticipate a potential splash. The bot could also simulate what the mana might look like, and see whether splashing is likely to make the deck better.

That’s also a good way to value drafting lands or other mana fixing. Mana fixing reduces the discount you apply on the other cards in your deck, including future picks, but in most cases does not otherwise provide value. So if you pick a dual land, this causes your expected mana base to improve, which makes all cards in the improved colors more valuable, which is exactly what is actually happening, and then you apply a penalty based on how much downside the card has to use – so if you draft a Sacred Foundry, it’s actually a negative-value card but it increases your colored land count by one, so it sometimes is worth it. And like everything else, there’s variance in how much each bot cares about its mana, and how much it discounts cards on that basis.

Getting the formula for ‘how often am I going to be drafting this color going forward’ correct will be tricky, and again something we’ll want to tune. But it’s not terrible to start by assuming something similar to, being the total worth of one first-pick quality card ahead of alternatives in the middle of pack one, considering all factors, puts you at something like 75%, two such picks at 95%, three at 99%, and that getting more extreme as the draft goes on.

As the end of the draft approaches, the bots should naturally start to get worried about finishing their decks in reasonable fashion, if they’re still caught between three or more colors, and start to put higher value on depth and on leaning on small advantages, and less weight on signals because there’s less time left to use them. So we should see reasonable play emerge here.

Archetypes are trickier. We all know that some color combinations are better than others, some cards are better in some color combinations, and so on, but that can be hard to quantify. What to do?

Look at human drafts in the later packs. Look at deck frequency and match results.

As a first step level-one attempt, one can define each two-color pair as having different card values assuming that theme. We can get those values implicitly from the decisions players make in the third pack of the draft, for those drafts that are clearly in a given two-color pair (e.g. they have number of picks and assessed total value that is heavily concentrated in exactly two colors). We could add picks made earlier for drafts that have clearly committed, and which then stayed committed, and still likely have good data. Thus, if given a commitment to white/blue, a given white card has an average pick number of 4.6, as opposed to 6.3 for white/black decks, that tells us it’s a better card in white/blue. So we set the conditional numbers accordingly. Ideally, we’d then figure out the value we have in each color pair, times the estimated probability we end up in that pair from this point in the draft. The resulting behavior would then tell us what adjustments we’d need to keep average pick numbers in line with the humans.

For the quality of color pairs, we can look at frequency of them being drafted by humans that have strong overall records, and we can look at the records of drafts containing those colors. Results have statistical issues, since the decision to draft an unusual deck could tend to happen when conditions are unusually good (e.g. I only draft this color when I open a bomb) or unusually bad (nothing else was available), so we should expect some distortions that will be hard to properly correct for. Given that the system is trying to correct card valuations on a card by card basis, the system could also get caught in a loop where it can’t satisfy all its masters at once, because humans are not telling a consistent story. We’re like that. Thus, I’d be inclined to do some sanity checking of the computer’s tenancies here and do manual adjustments where needed.

That brings another important point. Magic Online and Magic Arena draft rates for different color combinations should ideally look similar at similar drafter skill levels, and have similar win rates. This not being true would be a sign that the bots are doing something wrong, and this being close enough to true is a sign we’re doing at least some things right.

There are lots of checks like this, and lots of ways to forcibly adjust the system. We should focus on the ones that are primary to player experience, and ensure those work even if the bots have to do things in quirky ways in other places to make that happen. In particular, players should be drafting about the same color combinations with about the same win rates for them, and with about the same concentration of duplicate cards, and about the same frequency of being passed the best cards in the set. And the bots should pass rares about as often as humans do on Arena.

That doesn’t solve the problem of what happens if a color pair has two or more distinct strategies that it supports. If you knew this was the case, and could make a rule for recognizing which was which, you could have the computer treat them as distinct color pair options, and everything would work properly with enough work. But asking the computer to figure this out on its own seems like quite the rough ask, especially if some of them are rare, such as relying on getting particular uncommon cards like High Alert or Time of Heroes.

We could try to apply machine learning to the contents of successful decks to see if such patterns emerge on their own. And if we have enough resources to try, it’s worth trying. But I’m guessing that for practical purposes, when there are multiple major distinct things going on, a human is going to have to prompt the system to know this. The whole proposal here is about making something human-readable so it can be adjusted to serve our needs and create meaningful but realistic feeling variations and bot personalities, as opposed to a machine learning mess that might create better draft decks but which wouldn’t (I believe) serve our needs better.

Problem 4: Drafting a Cohesive Deck

Magic has many levels. If you draft the good cards in two colors, you’ll have a deck of playable cards most of the time. If you then take the best 23 cards and pair them with 17 lands(adjusting those counts a little when mana costs are especially low or high) with the lands chosen weighted by maximizing the total value of your spells after discounting, rather than current land defaults, those decks would likely play fine at Friday Night Magic.

We still have the capability to do a lot better. By doing so, we let subtle things in the draft matter in ways that will make drafting a better and richer experience, even if no one ever looks at the bot decks, or uses the bot decks for anything. We’d like to know that if we cut someone off of creatures, they’ll need to prioritize them later in the draft, or if they are glutting on four drops they’ll start passing us good ones, or if we give them a bunch of tribal lords they’ll focus more on tribe members.

What are some low hanging fruit we can pick to ⁸⁰⁄₂₀ these improvements, and get the bots to where their decisions are at least reasonable?

I think we can safely think like humans, and create a few weighting categories to implement, and that will get us to the ⁸⁰⁄₂₀ spot. Beyond that, this is a spot where machine learning would be at its best, and be the least disruptive to making the overall package human readable, so I think that’s what you’d look to if you wanted to go beyond that.

We start with a model that takes the quality of each deck by adding up the quality of its cards discounted by how easy they are to cast, and adds that up, comparing that to the results. That’s our baseline, and we look for how much factors cause humans to underperform or overperform, to figure out what weights to attach.

First, I’d assume pairwise synergies were real. So if there was an Elf lord, you’d expect players who had him to do better with more elves than you’d expect from their decks otherwise. Sometimes these would be subtle, including things humans might miss.

Second, I’d look at mana curve. We would assume that having a smooth curve has some value, both for creature-like things and for things in general. We would assume that one did not want to generally be too slow, although being super fast can go either way (sometimes it’s just you running out of power, sometimes you win a lot of your games on turn five).

Third, I’d look at counts for various types of effects, especially creatures and removal, but also card draw slash filtering, counters, and mana acceleration.

In some cases where the synergies were obvious (e.g. proliferate in War of the Spark) we’d check to see if the first check was good enough to cover this, or if we need to do something manually. My guess is that pairwise is mostly good enough to ⁸⁰⁄₂₀ this, but with exceptions, and with benefits for doing something more complex.

Problem 5: Mimic versus Optimize

There are two things one can seek to centrally optimize for at almost every step of this process.

One option is to mimic human behavior. We’d seek to make the same decisions and have the same statistical properties, on average, as humans of the right reference class making those decisions. We learn from humans what their revealed preferences are, and match them.

The other option is to maximize results (or some combination of results and rarity/value). We form a system for predicting how decks will do based on the cards they contain, and we maximize the quality of the drafted deck as much as possible. We could look at overall draft deck results, or even drill in deep (in a way that I don’t suggest doing here, it seems far too complex and hard for a first step) and ask whether drawing a particular card in games is helpful. If humans fail to realize that a card is awesome, the bots should pick up on this and prioritize it. If humans consistently first pick a card, but the card is bad and doesn’t help you win, the bots should let the humans have it.

Balancing these two approaches will be one of the keys to success. If we focus too much on mimicry without a good reason why, the resulting decks won’t be good and we’ll be duplicating the biases and errors we see, often compounding them in order to make things come out even. If we focus too much on optimization, it risks being invisible to players, looking like it’s reaching dumb conclusions and making terrible decisions, or reaching actually dumb conclusions and making actually terrible decisions. It could backfire quite badly. Thus, in most places, I emphasize mimicry and trusting the humans, and only don’t do this in the synergy department because I couldn’t find a clean way to do so there that seemed like it would work and be reasonable to implement.

Conclusion

The problem of how to draft as a human is deep enough to keep us engaged for a long time. So is the problem of programming a bot to do so, no matter which goals one prioritizes. I hope this points us in the direction of how to get a reasonable first attempt, so we can start incorporating some of the most important principles of good drafting, allowing reasonable self-play and a better drafting experience for humans. Doubtless better coders than I would be required to make it happen, and would point to many of these things and say ‘that’s way harder than you think it is’ or ‘that won’t work.’ Others will prove difficult or unworkable in practice. But the first step is making a real effort, iterating until we can get something that is at least trying, and which is designed to allow growth as it learns more.

From there, the sky’s the limit.