An analysis of the Less Wrong D&D.Sci 4th Edition game

This is an analysis of the game described in this post—you might want to read that first. Or if you don’t want to, here’s a quick summary: There’s a team-based competitive video game, the Green team has a game coming up against the Blue team, and we’re supposed to advise the Green team which team to use. There are 19 characters and each team is composed of 5 of them. We know which team the Blue team is going to use, so want to find the Green team best suited to this known opponent team. To this end we have data on 65K historical games. The data is just the character picks for each team, and the game result.

I trained a predictor on most of the games, and used the rest to validate. The predictor has an out-of-sample AUC of 0.79, which is not fantastic, but if these are really supposed to be games played by people, we can hardly hope that using just the team compositions to predict who won would get close to perfect accuracy. The post describing the game gives a single Blue team composition to optimize against. Once I had the predictor, I generated all possible Green teams, and evaluated how each would do against that composition. That gave me win probabilities for every Green team against the one single Blue team.

The predictor has some calibration problems:

Although the model has OK calibration, it’s pretty underconfident at the low end, and pretty overconfident at the high end. But I’m sick of dinking with the model, so let’s push on, pretend it’s perfectly calibrated, and talk about its predictions as win probabilities.

Blue has a good team

So, first, Blue has quite the team. There are 11,628 (choose-5-from-19) possible team comps for Green, but against the Blue comp from the original post, less than 700 of these have a better-than-even win probability:

This is really surprising. More on this later.

How Green can beat Blue anyway

So here’s my recommendation to the Green team: The team with the highest probability of winning (against the Blue team Dire Druid, Greenery Giant, Phoenix Paladin, Quartz Questant, Tidehollow Tyrant) is Arch-Alligator, Greenery Giant, Landslide Lord, Nullifying Nightmare, and Phoenix Paladin. The predicted win probability for that Green team is around 75%. (Considering the calibration plot above, maybe we should move this to 78%.)

Let’s see what some winning Green compositions look like. There are 132 teams with a predicted win probability of 60% or greater (60% chosen arbitrarily). Here’s how often each character shows up in these top teams:

Since there are 19 characters, and 5 per team, so each shows up in 519 = 26% of all possible compositions. But Arch-Alligator shows up in 87% of these winning compositions! Or 3x as often as if we picked teams randomly. So Arch-Alligator and Landslide Lord are very good against this Blue composition for some reason, Volcano Villain and onward are bad against it, and the other characters are of varying effectiveness.

How well do individual heroes do against each other?

Up till now we’ve been looking at how well different compositions do against the one specific Blue composition that we think the team we’re playing will choose. But now forget them for a moment, and instead, let’s look at this video game as a whole: how well do the different heroes do against each other?

I looked at the win probabilities for all kinds of different match ups, pitting 1,400 random Green teams against 1,400 different random Blue teams, for a total 1,400^2 = 2 million games. Then I took the average win probabilities for each individual character on each Green team against each individual on the Blue team. For example, to evaluate the matchup of Nullifying Nightmare versus Tidehollow Tyrant, I took the mean win probability of the ~140,000 games between them in my generated set of 2 million, and in these 140K games the rest of the team was random.

Nullifying Nightmare is super OP. Some other characters are overpowered too: Blaze Boy, Greenery Giant, and Tidehollow Tyrant, and kinda Warrior of Winter all have favorable matchups against most other heroes. (Except for Nightmare, it can be easier to see this by looking at the rows instead of the columns: Blaze Boy’s row [near the bottom] is mostly blue, which says that Green’s win probability is lower than 50% for most of those matchups.)

The opponent Blue team has two of the four overpowered characters

In the light of the heatmap, maybe we can start to explain why the Blue composition from the previous section was favorable against most teams: it has Greenery Giant and Tidehollow Tyrant, two of the four overpowered individual picks. What if we made teams just from the most four overpowered characters, plus whoever else in the 5th position? Turns out most of these teams would rock:

The Blue team from the post only has two of these OP (overpowered) characters, but about the same predicted win percentage, which I don’t really understand. But a good strategy is stuffing your team with heroes from these top four, and Blue is halfway there.

Arch-Alligator’s effectiveness against the opponent’s Blue team is confusing

From the dominance of Arch-Alligator in the “How Green can beat Blue anyway” section, I thought I’d find that he was OP, but he’s actually right near 50% for most matchups. Here’s something else weird: Alligator is not very good against any of the Blue team’s characters! To see this on the heatmap, look at Arch-Alligator’s column in the heatmap, and look at the entries for the Blue team from the post [Reminder: that Blue team is Dire, Greenery, Phoenix, Quartz, and Tidehollow]. Most of those five tiles are light blue, indicating a less-than-50% win rate in the individual match-ups. Alligator is an especially bad pick against Greenery Giant. Being a bad or neutral independent matchup against each individual team member, but great against the overall composition, suggests there are some team synergies going on that the individual-matchups view is missing (or that I’ve made some error). I can think of a few things that might be going on:

  • Maybe Alligator and one or more of the other members of the team (Greenery Giant/​Landslide Lord/​Nullifying Nightmare/​Phoenix Paladin) work well in general.

  • Maybe Alligator and 1+ other team members work well together against this particular Blue composition.

  • Maybe Alligator doesn’t work particularly well with other members of the Green team, but is a counter for some combination of the Blue team members. For example, maybe two Blue team members can pair up and charge for 5 seconds in order to unleash some powerful ability, and Alligator is good at interrupting this pairing.

I have some ideas on how to look into this, but it feels hard, I’ve been writing this post all afternoon, and aphyer is posting the answers two days from now, so I probably won’t get to it!

Notes on this analysis

I used a style of analysis called surrogate analysis here. Instead of interrogating the data directly, I instead trained a model on the data, and interrogated the model. The downside of this is if the model is bad, the analysis will lead you into strange places that don’t correspond to reality. The upside is that you can ask the model questions you couldn’t ask the data directly. For example, there are 65K matchups in the dataset here, but the total number of possible matchups is 19-choose-5 squared, which is 135 million. The surrogate model can give answers on all those games! If team compositions not present in the data are totally different than the ones present, then the model will give bad answers to them, but if the interactions and patterns from the data are generalizable, the surrogate can be really helpful. I think this book covers surrogate modeling well, though I’ve only read a few chapters.

I used the gradient-boosting algorithm XGBoost because it runs fast and I have good experiences with it. Any booster would be fine, as would a random forest, but random forest in R is comparatively slow and I didn’t want to wait around too long while re-training the model as I iterated.

Here’s my (messy) code.