The unexpected difficulty of comparing AlphaStar to humans

Link post

This is cross­posted from the AI Im­pacts blog.

Ar­tifi­cial in­tel­li­gence defeated a pair of pro­fes­sional Star­craft II play­ers for the first time in De­cem­ber 2018. Although this was gen­er­ally re­garded as an im­pres­sive achieve­ment, it quickly be­came clear that not ev­ery­body was satis­fied with how the AI agent, called AlphaS­tar, in­ter­acted with the game, or how its cre­ator, Deep­Mind, pre­sented it. Many ob­servers com­plained that, in spite of Deep­Mind’s claims that it performed at similar speeds to hu­mans, AlphaS­tar was able to con­trol the game with greater speed and ac­cu­racy than any hu­man, and that this was the rea­son why it pre­vailed.

Although I think this story is mostly cor­rect, I think it is harder than it looks to com­pare AlphaS­tar’s in­ter­ac­tion with the game to that of hu­mans, and to de­ter­mine to what ex­tent this mat­tered for the out­come of the matches. Merely com­par­ing raw num­bers for ac­tions taken per minute (the usual met­ric for a player’s speed) does not tell the whole story, and ap­pro­pri­ately tak­ing into ac­count mouse ac­cu­racy, the differ­ences be­tween com­bat ac­tions and non-com­bat ac­tions, and the con­trol of the game’s “cam­era” turns out to be quite difficult.

Here, I be­gin with an overview of Star­craft II as a plat­form for AI re­search, a timeline of events lead­ing up to AlphaS­tar’s suc­cess, and a brief de­scrip­tion of how AlphaS­tar works. Next, I ex­plain why mea­sur­ing perfor­mance in Star­craft II is hard, show some anal­y­sis on the speed of both hu­man and AI play­ers, and offer some pre­limi­nary con­clu­sions on how AlphaS­tar’s speed com­pares to hu­mans. After this, I dis­cuss the differ­ences in how hu­mans and AlphaS­tar “see” the game and the im­pact this has on perfor­mance. Fi­nally, I give an up­date on Deep­Mind’s cur­rent ex­per­i­ments with Star­craft II and ex­plain why I ex­pect we will en­counter similar difficul­ties when com­par­ing hu­man and AI perfor­mance in the fu­ture.

Why Star­craft is a Tar­get for AI Re­search

Star­craft II has been a tar­get for AI for sev­eral years, and some read­ers will re­call that Star­craft II ap­peared on our 2016 ex­pert sur­vey. But there are many games and many AIs that play them, so it may not be ob­vi­ous why Star­craft II is a tar­get for re­search or why it is of in­ter­est to those of us that are try­ing to un­der­stand what is hap­pen­ing with AI.

For the most part, Star­craft II was cho­sen be­cause it is pop­u­lar, and it is difficult for AI. Star­craft II is a real time strat­egy game, and like similar games, it re­quires a va­ri­ety of tasks: har­vest­ing re­sources, con­struct­ing bases, re­search­ing tech­nol­ogy, build­ing armies, and at­tempt­ing to de­stroy their op­po­nent’s base are all part of the game. Play­ing it well re­quires bal­anc­ing at­ten­tion be­tween many things at once: plan­ning ahead, en­sur­ing that one’s units1 are good coun­ters for the en­emy’s units, pre­dict­ing op­po­nents’ moves, and chang­ing plans in re­sponse to new in­for­ma­tion. There are other as­pects that make it difficult for AI in par­tic­u­lar: it has im­perfect in­for­ma­tion2, an ex­tremely large ac­tion space, and takes place in real time. When hu­mans play, they en­gage in long term plan­ning, mak­ing the best use of their limited ca­pac­ity for at­ten­tion, and craft­ing ploys to de­ceive the other play­ers.

The game’s pop­u­lar­ity is im­por­tant be­cause it makes it a good source of ex­tremely high hu­man tal­ent and in­creases the num­ber of peo­ple that will in­tu­itively un­der­stand how difficult the task is for a com­puter. Ad­di­tion­ally, as a game that is de­signed to be suit­able for high-level com­pe­ti­tion, the game is care­fully bal­anced so that com­pe­ti­tion is fair, does not fa­vor just one strat­egy3, and does not rely too heav­ily on luck.

Timeline of Events

To put AlphaS­tar’s perfor­mance in con­text, it helps to un­der­stand the timeline of events over the past few years:

Novem­ber 2016: Bliz­zard and Deep­Mind an­nounce they are launch­ing a new pro­ject in Star­craft II AI

Au­gust 2017: Deep­Mind re­leases the Star­craft II API, a set of tools for in­ter­fac­ing AI with the game

March 2018: Oriol Vinyals gives an up­date, say­ing they’re mak­ing progress, but he doesn’t know if their agent will be able to beat the best hu­man players

Novem­ber 3, 2018: Oriol Vinyals gives an­other up­date at a Bliz­zcon panel, and shares a se­quence of videos demon­strat­ing AlphaS­tar’s progress in learn­ing the game, in­clud­ing lean­ing to win against the hard­est built-in AI. When asked if they could play against it that day, he says “For us, it’s still a bit early in the re­search.”

De­cem­ber 12, 2018: AlphaS­tar wins five straight matches against TLO, a pro­fes­sional Star­craft II player, who was play­ing as Pro­toss4, which is off-race for him. Deep­Mind keeps the matches se­cret.

De­cem­ber 19, 2018: AlphaS­tar, given an ad­di­tional week of train­ing time5, wins five con­sec­u­tive Pro­toss vs Pro­toss matches vs MaNa, a pro Star­craft II player who is higher ranked than TLO and spe­cial­izes in Pro­toss. Deep­Mind con­tinues to keep the vic­to­ries a se­cret.

Jan­uary 24, 2019: Deep­Mind an­nounces the suc­cess­ful test matches vs TLO and MaNa in a live video feed. MaNa plays a live match against a ver­sion of AlphaS­tar which had more con­straints on how it “saw” the map, forc­ing it to in­ter­act with the game in a way more similar to hu­mans6. AlphaS­tar loses when MaNa finds a way to ex­ploit a blatant failure of the AI to man­age its units sen­si­bly. The re­plays of all the matches are re­leased, and peo­ple start ar­gu­ing7 about how (un)fair the matches were, whether AlphaS­tar is any good at mak­ing de­ci­sions, and how hon­est Deep­Mind was in pre­sent­ing the re­sults of the matches.

July 10, 2019: Deep­Mind and Bliz­zard an­nounce that they will al­low an ex­per­i­men­tal ver­sion of AlphaS­tar to play on the Euro­pean lad­der8, for play­ers who opt in. The agent will play anony­mously, so that most play­ers will not know that they are play­ing against a com­puter. Over the fol­low­ing weeks, play­ers at­tempt to dis­cern whether they played against the agent, and some post re­plays of matches in which they be­lieve they were matched with the agent.

How AlphaS­tar works

The best place to learn about AlphaS­tar is from Deep­Mind’s page about it. There are a few par­tic­u­lar as­pects of the AI that are worth keep­ing in mind:

It does not in­ter­act with the game like a hu­man does: Hu­mans in­ter­act with the game by look­ing at a screen, listen­ing through head­phones or speak­ers, and giv­ing com­mands through a mouse and key­board. AlphaS­tar is given a list of units or build­ings and their at­tributes, which in­cludes things like their lo­ca­tion, how much dam­age they’ve taken, and which ac­tions they’re able to take, and gives com­mands di­rectly, us­ing co­or­di­nates and unit iden­ti­fiers. For most of the matches, it had ac­cess to in­for­ma­tion about any­thing that wouldn’t nor­mally be hid­den from a hu­man player, with­out need­ing to con­trol a “cam­era” that fo­cuses on only one part of the map at a time. For the fi­nal match, it had a cam­era re­stric­tion similar to hu­mans, though it still was not given screen pix­els as in­put. Be­cause it gives com­mands di­rectly through the game, it does not need to use a mouse ac­cu­rately or worry about tap­ping the wrong key by ac­ci­dent.

It is trained first by watch­ing hu­man matches, and then through self-play: The neu­ral net­work is trained first on a large database of matches be­tween hu­mans, and then by play­ing against ver­sions of it­self.

It is a set of agents se­lected from a tour­na­ment: Hun­dreds of ver­sions of the AI play against each other, and the ones that perform best are se­lected to play against hu­man play­ers. Each one has its own set of units that it is in­cen­tivized to use via re­in­force­ment learn­ing, so that they each play with differ­ent strate­gies. TLO and MaNa played against a to­tal of 11 agents, all of which were se­lected from the same tour­na­ment, ex­cept the last one, which had been sub­stan­tially mod­ified. The agents that defeated MaNa had each played for hun­dreds of years in the vir­tual tour­na­ment9.

Jan­uary/​Fe­bru­ary Im­pres­sions Sur­vey

Be­fore de­cid­ing to fo­cus my in­ves­ti­ga­tion on a com­par­i­son be­tween hu­man and AI perfor­mance in Star­craft II, I con­ducted an in­for­mal sur­vey with my Face­book friends, my col­leagues at AI Im­pacts, and a few peo­ple from an effec­tive al­tru­ism Face­book group. I wanted to know what they were think­ing about the matches in gen­eral, with an em­pha­sis on which fac­tors most con­tributed to the out­come of the matches. I’ve put de­tails about my anal­y­sis and the full re­sults of the sur­vey in the ap­pendix at the end of this ar­ti­cle, but I’ll sum­ma­rize a few ma­jor re­sults here.

Forecasts

The timing and na­ture of AlphaS­tar’s suc­cess seems to have been mostly in line with peo­ple’s ex­pec­ta­tions, at least at the time of the an­nounce­ment. Some re­spon­dents did not ex­pect to see it for a year or two, but on av­er­age, AlphaS­tar was less than a year ear­lier than ex­pected. It is prob­a­ble that some re­spon­dents had been ex­pect­ing it to take longer, but up­dated their pre­dic­tions in 2016 af­ter find­ing out that Deep­Mind was work­ing on it. For fu­ture ex­pec­ta­tions, a ma­jor­ity of re­spon­dents ex­pect to see an agent (not nec­es­sar­ily AlphaS­tar) that can beat the best hu­mans with­out any of the cur­rent caveats within two years. In gen­eral, I do not think that I worded the fore­cast­ing ques­tions care­fully enough to in­fer very much from the an­swers given by sur­vey re­spon­dents.

Some read­ers may be won­der­ing how these sur­vey re­sults com­pare to those of our more care­ful 2016 sur­vey, or how we should view the ear­lier sur­vey re­sults in light of MaNa and TLOs defeat at the hands of AlphaS­tar. The 2016 sur­vey speci­fied an agent that only re­ceives a video of the screen, so that pre­dic­tion has not yet re­solved. But the me­dian re­spon­dent as­signed 50% prob­a­bil­ity of see­ing such an agent that can defeat the top hu­man play­ers at least 50% of the time by 202110. I don’t per­son­ally know how hard it is to add in that ca­pa­bil­ity, but my im­pres­sion from speak­ing to peo­ple with greater ma­chine learn­ing ex­per­tise than mine is that this is not out of reach, so these pre­dic­tions still seem rea­son­able, and are not gen­er­ally in dis­agree­ment with the re­sults from my in­for­mal sur­vey.

Speed

Nearly ev­ery­one thought that AlphaS­tar was able to give com­mands faster and more ac­cu­rately than hu­mans, and that this ad­van­tage was an im­por­tant fac­tor in the out­come of the matches. I looked into this in more de­tail, and wrote about it in the next sec­tion.

Camera

As I men­tioned in the de­scrip­tion of AlphaS­tar, it does not see the game the same way that hu­mans do. Its vi­sual field cov­ered the en­tire map, though its vi­sion was still af­fected by the usual fog of war11. Sur­vey re­spon­dents ranked this as an im­por­tant fac­tor in the out­come of the matches.

Given these re­sults, I de­cided to look into the speed and cam­era is­sues in more de­tail.

The Speed Con­tro­versy

Star­craft is a game that re­wards the abil­ity to micro­man­age many things at once and give many com­mands in a short pe­riod of time. Play­ers must si­mul­ta­neously build their bases, man­age re­source col­lec­tion, scout the map, re­search bet­ter tech­nol­ogy, build in­di­vi­d­ual units to cre­ate an army, and fight bat­tles against other play­ers. The com­bat is suffi­ciently fine grained that a player who is out­num­bered or out­gunned can of­ten come out ahead by ex­ert­ing bet­ter con­trol over the units that make up their mil­i­tary forces, both on a group level and an in­di­vi­d­ual level. For years, there have been sim­ple Star­craft II bots that, al­though they can­not win a match against a highly-skil­led hu­man player, can do amaz­ing things that hu­mans can’t do, by con­trol­ling dozens of units in­di­vi­d­u­ally dur­ing com­bat. In prac­tice, hu­man play­ers are limited by how many ac­tions they can take in a given amount of time, usu­ally mea­sured in ac­tions per minute (APM). Although Deep­Mind im­posed re­stric­tions on how quickly AlphaS­tar could re­act to the game and how many ac­tions it could take in a given amount of time, many peo­ple be­lieve that the agent was some­times able to act with su­per­hu­man speed and pre­ci­sion.

Here is a graph12 of the APM for MaNa (red) and AlphaS­tar (blue), through the sec­ond match, with five-sec­ond bins:

Ac­tions per minute for MaNa (red) and AlphaS­tar (blue) in their sec­ond game. The hori­zon­tal axis is time, and the ver­ti­cal axis is 5 sec­ond av­er­age APM.

At first glance, this looks rea­son­ably even. AlphaS­tar has both a lower av­er­age APM (180 vs MaNa’s 270) for the whole match, and a lower peak 5 sec­ond APM (495 vs Mana’s 615). This seems con­sis­tent with Deep­Mind’s claim that AlphaS­tar was re­stricted to hu­man-level speed. But a more de­tailed look at which ac­tions are ac­tu­ally taken dur­ing these peaks re­veals some cru­cial differ­ences. Here’s a sam­ple of ac­tions taken by each player dur­ing their peaks:

Lists of com­mands for MaNa and AlphaS­tar dur­ing each player’s peak APM for game 2

MaNa hit his APM peaks early in the game by us­ing hot keys to twitchily switch back and forth be­tween con­trol groups13 for his work­ers and the main build­ing in his base. I don’t know why he’s do­ing this: maybe to warm up his fingers (which ap­par­ently is a thing), as a way to watch two things at once, to keep him­self oc­cu­pied dur­ing the slow parts of the early game, or some other rea­son un­der­stood only by the kinds of peo­ple that can pro­duce Star­craft com­mands faster than I can type. But it drives up his peak APM, and prob­a­bly is not very im­por­tant to how the game un­folds14. Here’s what MaNa’s peak APM looked like at the be­gin­ning of Game 2 (if you look at the bot­tom of the screen, you can see that the units he has se­lected switches back-and-forth be­tween his work­ers and the build­ing that he uses to make more work­ers):

MaNa’s play dur­ing his peak APM for match 2. Most of his ac­tions con­sist of switch­ing be­tween con­trol groups with­out giv­ing new com­mands to any units or build­ings

AlphaS­tar hit peak APM in com­bat. The agent seems to re­serve a sub­stan­tial por­tion of its limited ac­tions bud­get un­til the crit­i­cal mo­ment when it can cash them in to elimi­nate en­emy forces and gain an ad­van­tage. Here’s what that looked like near the end of game 2, when it won the en­gage­ment that prob­a­bly won it the match (while still tak­ing a few ac­tions back at its base to keep its pro­duc­tion go­ing):

AlphaS­tar’s play dur­ing its peak APM in match 2. Most of its ac­tions are re­lated to com­bat, and re­quire pre­cise timing.

It may be hard to see what ex­actly is hap­pen­ing here for peo­ple who have not played the game. AlphaS­tar (blue) is us­ing ex­tremely fine-grained con­trol of its units to defeat MaNa’s army (red) in an effi­cient way. This in­volves sev­eral differ­ent ac­tions: Com­mand­ing units to move to differ­ent lo­ca­tions so they can make their way into his base while keep­ing them bunched up and avoid­ing spots that make them vuln­er­a­ble, fo­cus­ing fire on MaNa’s units to elimi­nate the most vuln­er­a­ble ones first, us­ing spe­cial abil­ities to lift MaNa’s units off the ground and dis­able them, and redi­rect­ing units to at­tack MaNa’s work­ers once a ma­jor­ity of MaNa’s mil­i­tary units are taken care of.

Given these differ­ences be­tween how MaNa and AlphaS­tar play, it seems clear that we can’t just use raw match-wide APM to com­pare the two, which most peo­ple pay­ing at­ten­tion seem to have no­ticed fairly quickly af­ter the matches. The more difficult ques­tion is whether AlphaS­tar won pri­mar­ily by play­ing with a level of speed and ac­cu­racy that hu­mans are in­ca­pable of, or by play­ing bet­ter in other ways. Though based on the anal­y­sis that I am about to pre­sent I think the an­swer is prob­a­bly that AlphaS­tar won through speed, I also think the ques­tion is harder to an­swer defini­tively than many crit­ics of Deep­Mind are mak­ing it out to be.

A very fast hu­man can av­er­age well over 300 APM for sev­eral min­utes, with 5 sec­ond bursts at over 600 APM. Although these bursts are not always throw­away com­mands like those from the MaNa vs AlphaS­tar matches, they tend not to be com­mands that re­quire highly ac­cu­rate click­ing, or rapid move­ment across the map. Take, for ex­am­ple, this 10 sec­ond, 600 APM peak from cur­rent top player Ser­ral:

Ser­ral’s play dur­ing a 10 sec­ond, 600 APM peak

Here, Ser­ral has just finished fo­cus­ing on a pair of bat­tles with the other player, and is tak­ing care of busi­ness in his base, while still pick­ing up some pieces on the bat­tlefield. It might not be ob­vi­ous why he is is­su­ing so many com­mands dur­ing this time, so let’s look at the list of com­mands:

The lines that say “Morph to Hy­dral­isk” and “Morph to Roach” rep­re­sent a se­ries of re­peats of that com­mand. For a hu­man player, this is a mat­ter of press­ing the same hotkey many times, or even just hold­ing down the key to give the com­mand very rapidly15. You can see this in the gif by look­ing at the bot­tom cen­ter of the screen where he se­lects a bunch of worm-look­ing things and turns them all into a bunch of egg-look­ing things (it hap­pens very quickly, so it can be easy to miss).

What Ser­ral is do­ing here is difficult, and the abil­ity to do it only comes with years of prac­tice. But the raw num­bers don’t tell the whole story. Tak­ing 100 ac­tions in 10 sec­onds is much eas­ier when a third of those ac­tions come from hold­ing down a key for a few hun­dred mil­lisec­onds than when they each re­quire a press of a differ­ent key or a pre­cise mouse click. And this is with­out all the ex­tra­ne­ous ac­tions that hu­mans of­ten take (as we saw with MaNa).

Be­cause it seems to be the case that peak hu­man APM hap­pens out­side of com­bat, while AlphaS­tar’s wins hap­pened dur­ing com­bat APM peaks, we need to do a more de­tailed anal­y­sis to de­ter­mine the high­est APM a hu­man player can achieve dur­ing com­bat. To try to an­swer this ques­tion, I looked at ap­prox­i­mately ten APM for each of the 5 games be­tween AlphaS­tar and MaNa, as well as each of an­other 15 re­plays be­tween pro­fes­sional Star­craft II play­ers. The peaks were cho­sen so that roughly half were the largest peak at any time dur­ing the match and the rest were strictly dur­ing com­bat. My method­ol­ogy for this is given in the ap­pendix. Here are the re­sults for just the hu­man vs hu­man matches:

His­togram of 5-sec­ond APM peaks from an­a­lyzed matches be­tween hu­man pro­fes­sional play­ers in a tour­na­ment set­ting The blue bars are peaks achieved out­side of com­bat, while the red bars are those achieved dur­ing com­bat.

Pro­vi­sion­ally, it looks like pro play­ers fre­quently hit ap­prox­i­mately 550 to 600 APM out­side of com­bat be­fore the dis­tri­bu­tion starts to fall off, and they peak at around 200-350 dur­ing com­bat, with a long right tail. As I was do­ing this, how­ever, I found that all of the high­est APM peaks had one thing in com­mon with each other that they did not have in com­mon with all of the lower APM peaks, which is that it was difficult to tell when a player’s ac­tions are pri­mar­ily com­bat-ori­ented com­mands, and when they are mixed in with bursts of com­mands for things like train­ing units. In par­tic­u­lar, I found that the com­bat situ­a­tions with high APM tended to be similar to the Ser­ral gif above, in that they in­volve spam click­ing and ac­tions re­lated to the player’s econ­omy and pro­duc­tion, which was prob­a­bly driv­ing up the num­bers. I give more de­tails in the ap­pendix, but I don’t think I can say with con­fi­dence that any play­ers were achiev­ing greater than 400-450 APM in com­bat, in the ab­sence of spu­ri­ous ac­tions or macro­man­age­ment com­mands.

The more per­ti­nent ques­tion might what the low­est APM is that a player can have while still suc­ceed­ing at the high­est level. Since we know that hu­mans can suc­ceed with­out ex­ceed­ing this APM, it is not an un­rea­son­able limi­ta­tion to put on AlphaS­tar. The low­est peak APM in com­bat I saw for a win­ning player in my anal­y­sis was 215, though it could be that I missed a higher peak dur­ing com­bat in that same match.

Here is a his­togram of AlphaS­tar’s com­bat APM:

The small­est 5-sec­ond APM that AlphaS­tar needed to win a match against MaNa was just shy of 500. I found 14 cases in which the agent was able to av­er­age over 400 APM for 5 sec­onds in com­bat, and six times when the agent av­er­aged over 500 APM for more than 5 sec­onds. This was done with perfect ac­cu­racy and no spam click­ing or con­trol group switch­ing, so I think we can safely say that its play was faster than is re­quired for a hu­man to win a match in a pro­fes­sional tour­na­ment. Given that I found no cases where a hu­man was clearly achiev­ing this speed in com­bat, I think I can com­fortably say that AlphaS­tar had a large enough speed ad­van­tage over MaNa to have sub­stan­tially in­fluenced the match.

It’s easy to get lost in num­bers, so it’s good to take a step back and re­mind our­selves of the in­sane level of skill re­quired to play Star­craft II pro­fes­sion­ally. The top pro­fes­sional play­ers already play with what looks to me like su­per­hu­man speed, pre­ci­sion, and mul­ti­task­ing, so it is not sur­pris­ing that the agent that can beat them is so fast. Some ob­servers, es­pe­cially those in the Star­craft com­mu­nity, have in­di­cated that they will not be im­pressed un­til AI can beat hu­mans at Star­craft II at sub-hu­man APM. There is some ex­tent to which speed can make up for poor strat­egy and good strat­egy can make up for a lack of speed, but it is not clear what the limits are on this trade-off. It may be very difficult to make an agent that can beat pro­fes­sional Star­craft II play­ers while re­strict­ing its speed to an undis­put­edly hu­man or sub-hu­man level, or it may sim­ply be a mat­ter of a cou­ple more weeks of train­ing time.

The Camera

As I ex­plained ear­lier, the agent in­ter­acts with the game differ­ently than hu­mans. As with other games, hu­mans look at a screen to know what’s hap­pen­ing, use a mouse and key­board to give com­mands, and need to move the game’s ‘cam­era’ to see differ­ent parts of the play area. With the ex­cep­tion of the fi­nal ex­hi­bi­tion match against MaNa, AlphaS­tar was able to see the en­tire map at once (though much of it is con­cealed by the fog of war most of the time), and had no need to se­lect units to get in­for­ma­tion about them. It’s un­clear just how much of an ad­van­tage this was for the agent, but it seems likely that it was sig­nifi­cant, if noth­ing else be­cause it did not suffer from the APM over­head just to look around and get in­for­ma­tion from the game. Fur­ther­more, see­ing the en­tire map makes it eas­ier to si­mul­ta­neously con­trol units across the map, which AlphaS­tar used to great effect in the first five matches against MaNa.

For the ex­hi­bi­tion match in Jan­uary, Deep­Mind trained a ver­sion of AlphaS­tar that had similar cam­era con­trol to hu­man play­ers. Although the agent still saw the game in a way that was ab­stracted from the screen pix­els that hu­mans see, it only had ac­cess to about one screen’s worth of in­for­ma­tion at a time, and it needed to spend ac­tions to look at differ­ent parts of the map. A fur­ther dis­ad­van­tage was that this ver­sion of the agent only had half as much train­ing time as the agents that beat MaNa.

Here are three fac­tors that may have con­tributed to AlphaS­tar’s loss:

  1. The agent was un­able to deal effec­tively with the added com­pli­ca­tion of con­trol­ling the camera

  2. The agent had in­suffi­cient train­ing time

  3. The agent had eas­ily ex­ploitable flaws the whole time, and MaNa figured out how to use them in match 6

For the third fac­tor, I mean that the agent had suffi­ciently many ex­ploitable flaws that were ob­vi­ous enough to hu­man play­ers that any skil­led hu­man player could find at least one dur­ing a small num­ber of games. The best hu­mans do not have a suffi­cient num­ber of such flaws to in­fluence the game with any reg­u­lar­ity. Matches in pro­fes­sional tour­na­ments are not won by caus­ing the other player to make the same ob­vi­ous-to-hu­mans mis­take over and over again.
I sus­pect that AlphaS­tar’s loss in Jan­uary is mainly due to the first two fac­tors. In sup­port of 1, AlphaS­tar seemed less able to si­mul­ta­neously deal with things hap­pen­ing on op­po­site sides of the map, and less will­ing to split its forces, which could plau­si­bly be re­lated to an in­abil­ity to si­mul­ta­neously look at dis­tant parts of the map. It’s not just that the agent had to move the cam­era to give com­mands on other parts of the map. The agent had to re­mem­ber what was go­ing on globally, rather than be­ing able to see it all the time. In sup­port of 2, the agent that MaNa defeated had only as much train­ing time as the agents that went up against TLO, and those agents lost to the agents that defeated MaNa 94% of the time dur­ing train­ing16.

Still, it is hard to dis­miss the third fac­tor. One way in which an agent can im­prove through train­ing is to en­counter tac­tics that it has not seen be­fore, so that it can re­act well if it sees it in the fu­ture. But the tac­tics that it en­coun­ters are only those that an­other agent em­ployed, and with­out see­ing the agents dur­ing train­ing, it is hard to know if any of them learned the ha­rass­ment tac­tics that MaNa used in game 6, so it is hard to know if the agents that defeated MaNa were sus­cep­ti­ble to the ex­ploit that he used to defeat the last agent. So far, the ev­i­dence from Deep­Mind’s more re­cent ex­per­i­ment pit­ting AlphaS­tar against the broader Star­craft com­mu­nity (which I will go into in the next sec­tion) sug­gests that the agents do not tend to learn defenses to these types of ex­ploits, though it is hard to say if this is a gen­eral prob­lem or just one as­so­ci­ated with low train­ing time or par­tic­u­lar kinds of train­ing data.

AlphaS­tar on the Lad­der

For the past cou­ple months, as of this writ­ing, skil­led Euro­pean play­ers have had the op­por­tu­nity to play against AlphaS­tar as part of the usual sys­tem for match­ing play­ers with those of similar skill. For the ver­sion of AlphaS­tar that plays on the Euro­pean lad­der, Deep­Mind claims to have made changes that ad­dress the cam­era and ac­tion speed com­plaints from the Jan­uary matches. The agent needs to con­trol the cam­era, and they say they have placed re­stric­tions on AlphaS­tar’s perfor­mance in con­sul­ta­tion with pro play­ers, par­tic­u­larly the max­i­mum ac­tions per minute and per sec­ond that the agent can make. I will be cu­ri­ous to see what num­bers they ar­rive at for this. If this was done in an iter­a­tive way, such that pro play­ers were al­lowed to see the agent play or to play against it, I ex­pect they were able to ar­rive at a good con­straint. Given the difficulty that I had with ar­riv­ing at a good value for a com­bat APM re­stric­tion, I’m less con­fi­dent that they would get a good value just by think­ing about it, though if they were suffi­ciently con­ser­va­tive, they prob­a­bly did alright.

Another rea­son to ex­pect a re­al­is­tic APM con­straint is that Deep­Mind wanted to run the Euro­pean lad­der matches as a blind study, in which the hu­man play­ers did not know they were play­ing against an AI. If the agent were to play with the su­per­hu­man speed and ac­cu­racy that AlphaS­tar did in Jan­uary, it would likely give it away and spoil the ex­per­i­ment.

Although it is un­clear that any play­ers were able to tell they were play­ing against an AI dur­ing their match, it does seem that some were able to figure it out af­ter the fact. One ex­am­ple comes from Lowko, who is a Dutch player who streams and does com­men­tary for games. Dur­ing a stream of a lad­der match in Star­craft II, he no­ticed the player was do­ing some strange things near the end of the match, like lift­ing their build­ings17 when the match had clearly been lost, and air-drop­ping work­ers into Lowko’s base to kill units. Lowko did even­tu­ally win the match. After­ward, he was able to view the re­play from the match and see that the player he had defeated did some very strange things through­out the en­tire match, the most no­table of which was how the player con­trol­led their units. The player used no con­trol groups at all, which is, as far as I know, not some­thing any­body does at high-level play18. There were many other quirks, which he de­scribes in his en­ter­tain­ing video, which I highly recom­mend to any­one who is in­ter­ested.

Other play­ers have re­leased re­play files from matches against play­ers they be­lieved were AlphaS­tar, and they show the same lack of con­trol groups. This is great, be­cause it means we can get a sense of what the new APM re­stric­tion is on AlphaS­tar. There are now dozens of re­play files from play­ers who claim to have played against the AI. Although I have not done the level of anal­y­sis that I did with the matches in the APM sec­tion, it seems clear that they have dras­ti­cally low­ered the APM cap, with the matches I have looked at top­ping out at 380 APM peaks, which did not even oc­cur in com­bat.

It seems to be the case that Deep­Mind has brought the agent’s in­ter­ac­tion with the game more in line with hu­man ca­pa­bil­ity, but we will prob­a­bly need to wait un­til they re­lease the de­tails of the ex­per­i­ment be­fore we can say for sure.

Another no­table as­pect of the matches that peo­ple are shar­ing is that their op­po­nent will do strange things that hu­man play­ers, es­pe­cially skil­led hu­man play­ers al­most never do, most of which are detri­men­tal to their suc­cess. For ex­am­ple, they will con­struct build­ings that block them into their own base, crowd their units into a dan­ger­ous bot­tle­neck to get to a clev­erly-placed en­emy unit, and fail to change tac­tics when their cur­rent strat­egy is not work­ing. Th­ese are all the types of flaws that are well-known to ex­ist in game-play­ing AI go­ing back to much older games, in­clud­ing the origi­nal Star­craft, and they are similar to the flaw that MaNa ex­ploited to defeat AlphaS­tar in game 6.

All in all, the agents that hu­mans are un­cov­er­ing seem to be ca­pa­ble, but not su­per­hu­man. Early on, the ac­counts that were iden­ti­fied as likely can­di­dates for be­ing AlphaS­tar were win­ning about 90-95% of their matches on the lad­der, achiev­ing Grand­mas­ter rank, which is re­served for only the top 200 play­ers in each re­gion. I have not been able to con­duct a care­ful in­ves­ti­ga­tion to de­ter­mine the win rate or Elo rat­ing for the agents. How­ever, based on the videos and re­plays that have been re­leased, plau­si­ble claims from red­dit users, and my own rec­ol­lec­tion of the records for the play­ers that seemed likely to be AlphaS­tar19, a good es­ti­mate is that they were win­ning a ma­jor­ity of matches among Grand­mas­ter play­ers, but did not achieve an Elo rat­ing that would sug­gest a fa­vor­able out­come in a re­match vs TLO20.

As with AlphaS­tar’s Jan­uary loss, it is hard to say if this is the re­sult of in­suffi­cient train­ing time, ad­di­tional re­stric­tions on cam­era con­trol and APM, or if the flaws are a deeper, harder to solve prob­lem for AI. It may seem un­rea­son­able to chalk this up to in­suffi­cient train­ing time given that it has been sev­eral months since the matches in De­cem­ber and Jan­uary, but it helps to keep in mind that we do not yet know what Deep­Mind’s re­search goals are. It is not hard to imag­ine that their goals are based around sam­ple effi­ciency or some other as­pect of AI re­search that re­quires such re­stric­tions. As with the APM re­stric­tions, we should learn more when we get re­sults pub­lished by Deep­Mind.

Dis­cus­sion

I have been fo­cus­ing on what many on­look­ers have been call­ing a lack of “fair­ness” of the matches, which seems to come from a sen­ti­ment that the AI did not defeat the best hu­mans on hu­man terms. I think this is a rea­son­able con­cern; if we’re try­ing to un­der­stand how AI is pro­gress­ing, one of our main in­ter­ests is when it will catch up with us, so we want to com­pare its perfor­mance to ours. Since we already know that com­put­ers can do the things they’re able to do faster than we can do them, we should be less in­ter­ested in ar­tifi­cial in­tel­li­gence that can do things bet­ter than we can by be­ing faster or by keep­ing track of more things at once. We are more in­ter­ested in AI that can make bet­ter de­ci­sions than we can.

Go­ing into this pro­ject, I thought that the dis­agree­ments sur­round­ing the fair­ness of the matches were due to a lack of care­ful anal­y­sis, and I ex­pected it to be very easy to eval­u­ate AlphaS­tar’s perfor­mance in com­par­i­son to hu­man-level perfor­mance. After all, the re­play files are just lists of com­mands, and when we run them through the game en­g­ine, we can eas­ily see the out­come of those com­mands. But it turned out to be harder than I had ex­pected. Separat­ing care­ful, nec­es­sary com­bat ac­tions (like tar­get­ing a par­tic­u­lar en­emy unit) from im­por­tant but less pre­cise ac­tions (like train­ing new units) from ex­tra­ne­ous, un­nec­es­sary ac­tions (like spam clicks) turned out to be sur­pris­ingly difficult. I ex­pect if I were to spend a few months learn­ing a lot more about how the game is played and writ­ing my own soft­ware tools to an­a­lyze re­play files, I could get closer to a defini­tive an­swer, but I still ex­pect there would be some un­cer­tainty sur­round­ing what ac­tu­ally con­sti­tutes hu­man perfor­mance.

It is un­clear to me where this leaves us. AlphaS­tar is an im­pres­sive achieve­ment, even with the speed and cam­era ad­van­tages. I am ex­cited to see the re­sults of Deep­Mind’s lat­est ex­per­i­ment on the lad­der, and I ex­pect they will have satis­fied most crit­ics, at least in terms of the agent’s speed. But I do not ex­pect it to be­come any eas­ier to com­pare hu­mans to AI in the fu­ture. If this sort of anal­y­sis is hard in the con­text of a game where we have ac­cess to all the in­puts and out­puts, we should ex­pect it to be even harder once we’re look­ing at tasks for which suc­cess is less clear cut or for which the AI’s out­put is harder to ob­jec­tively com­pare to hu­mans. This in­cludes some of the ma­jor tar­gets for AI re­search in the near fu­ture. Driv­ing a car does not have a sim­ple win-loss con­di­tion, and novel writ­ing does not have clear met­rics for what good perfor­mance looks like.

The an­swer may be that, if we want to learn things from fu­ture suc­cesses or failures of AI, we need to worry less about mak­ing di­rect com­par­i­sons be­tween hu­man perfor­mance and AI perfor­mance, and keep watch­ing the broad strokes of what’s go­ing on. From AlphaS­tar, we’ve learned that one of two things is true: Either AI can do long-term plan­ning, solve ba­sic game the­ory prob­lems, bal­ance differ­ent pri­ori­ties against each other, and de­velop tac­tics that work, or that there are tasks which seem at first to re­quire all of these things but did not, at least not at a high level.

Acknowledgements

Thanks to Gillian Ring for lend­ing her ex­per­tise in e-sports and for helping me un­der­stand­ing some of the nu­ances of the game. Thanks to users of the Star­craft sub­red­dit for helping me track down some of the fastest play­ers in the world. And thanks to Bliz­zard and Deep­Mind for mak­ing the AlphaS­tar match re­plays available to the pub­lic.

All mis­takes are my own, and should be pointed out to me via email at rick@aiim­pacts.org.

Ap­pendix I: Sur­vey Re­sults in Detail

I re­ceived a to­tal of 22 sub­mis­sions, which wasn’t bad, given its length. Two re­spon­dents failed to cor­rectly an­swer the ques­tion de­signed to filter out peo­ple that are goofing off or not pay­ing at­ten­tion, leav­ing 20 use­ful re­sponses. Five peo­ple who filled out the sur­vey were af­fili­ated in some way with AI Im­pacts. Here are the re­sponses for re­spon­dents’ self-re­ported level of ex­per­tise in Star­craft II and ar­tifi­cial in­tel­li­gence:

Sur­vey re­spon­dents’ mean ex­per­tise rat­ing was 4.6/​10 for Star­craft II and 4.9/​10 for AI.

Ques­tions About AlphaS­tar’s Performance

How fair were the AlphaS­tar matches?

For this one, it seems eas­iest to show a screen­shot from the sur­vey:

The re­sults from this in­di­cated that peo­ple thought the match was un­fair and fa­vored AlphaS­tar:

I asked re­spon­dents to rate AlphaS­tar’s over­all perfor­mance, as well as its “micro” and “macro”. The term “micro” is used to re­fer to a player’s abil­ity to con­trol units in com­bat, and is greatly im­proved by speed. There seems to have been some mi­s­un­der­stand­ing about how to use the word “macro”. Based on com­ments from re­spon­dents and look­ing around to see how peo­ple use the term on the In­ter­net, it seems that that there are at least three some­what dis­tinct ways that peo­ple use the phrase, and I did not clar­ify which I meant, so I’ve dis­carded the re­sults from that ques­tion.

For the next two ques­tions, the scale ranges from 0 to 10, with 0 la­beled “AlphaS­tar is much worse” and 10 la­beled “AlphaS­tar is much bet­ter”

Over­all, how do you think AlphaS­tar’s perfor­mance com­pares to the best hu­mans?

I found these re­sults in­ter­est­ing, be­cause AlphaS­tar was able to con­sis­tently defeat pro­fes­sional play­ers, so some sur­vey re­spon­dents felt the out­come alone was not enough to rate it as at least as good as the best hu­mans.

How do you think AlphaS­tar’s micro com­pares to the best hu­mans?

Sur­vey re­spon­dents unan­i­mously re­ported that they thought AlphaS­tar’s com­bat micro­man­age­ment was an im­por­tant fac­tor in the out­come of the matches.

Fore­cast­ing Questions

Re­spon­dents were split on whether they ex­pected to see AlphaS­tar’s level of Star­craft II perfor­mance by this time:

Did you ex­pect to see AlphaS­tar’s level of perfor­mance in a Star­craft II agent:

Be­fore Now1
Around this time8
Later than now7
I had no ex­pec­ta­tion ei­ther way4

Re­spon­dents who in­di­cated that they ex­pected it sooner or later than now were also asked by how many years their ex­pec­ta­tion differed from re­al­ity. If we as­sign nega­tive num­bers to “be­fore now”, pos­i­tive num­bers to “Later than now”, zero to “Around this time”, ig­nore those with no ex­pec­ta­tion, and weight re­sponses by level of ex­per­tise, we find re­spon­dents’ mean ex­pec­ta­tion was just 9 months later the an­nounce­ment, and the me­dian re­spon­dent ex­pected to see it around this time. Here is a his­togram of these re­sults, with­out ex­per­tise weight­ing:

Th­ese re­sults do not gen­er­ally in­di­cate too much sur­prise about see­ing a Star­craft II agent of AlphaS­tar’s abil­ity now.

How many years do you think it will be un­til we see (in pub­lic) an agent which only gets screen pix­els as in­put, has hu­man-level apm and re­ac­tion speed, and is very clearly bet­ter than the best hu­mans?

This ques­tion was in­tended to out­line an AI that would satisfy al­most any­body that Star­craft II is a solved game, such that AI is clearly bet­ter than hu­mans, and not for “bor­ing” rea­sons like su­pe­rior speed. Most sur­vey re­spon­dents ex­pected to see such an agent in two-ish years, with a few a lit­tle longer, and two that ex­pected it to take much longer. Re­spon­dents had a me­dian pre­dic­tion of two years and an ex­per­tise-weighted mean pre­dic­tion of a lit­tle less than four years.

Ques­tions About Rele­vant Considerations

How im­por­tant do you think the fol­low­ing were in de­ter­min­ing the out­come of the AlphaS­tar vs MaNa matches?

I listed 12 pos­si­ble con­sid­er­a­tions to be rated in im­por­tance, from 1 to 5, with 1 be­ing “not at all im­por­tant” and 5 be­ing “ex­tremely im­por­tant”. The ex­per­tise weighted mean for each ques­tion is given be­low:

Re­spon­dents rated AlphaS­tar’s peak APM and cam­era con­trol as the two most im­por­tant fac­tors in de­ter­min­ing the out­come of the matches, and the par­tic­u­lar choice of map and pro­fes­sional player as the two least im­por­tant con­sid­er­a­tions.

When think­ing about AlphaS­tar as a bench­mark for AI progress in gen­eral, how im­por­tant do you think the fol­low­ing con­sid­er­a­tions are?

Again, re­spon­dents rated a se­ries of con­sid­er­a­tions by im­por­tance, this time for think­ing about AlphaS­tar in a broader con­text. This in­cluded all of the con­sid­er­a­tions from the pre­vi­ous ques­tion, plus sev­eral oth­ers. Here are the re­sults, again with ex­per­tise weighted av­er­ag­ing.

For these two sets of ques­tions, there was al­most no differ­ence be­tween the mean scores if I used only Star­craft II ex­per­tise weight­ing, only AI ex­per­tise weight­ing, or ig­nored ex­per­tise weight­ing en­tirely.

Fur­ther questions

The rest of the ques­tions were free-form to give re­spon­dents a chance to tell me any­thing else that they thought was im­por­tant. Although these an­swers were thought­ful and shaped my think­ing about AlphaS­tar, es­pe­cially early on in the pro­ject, I won’t sum­ma­rize them here.

Ap­pendix II: APM Mea­sure­ment Methodology

I cre­ated a list of pro­fes­sional play­ers by ask­ing users of the Star­craft sub­red­dit which play­ers they thought were ex­cep­tion­ally fast. Re­plays in­clud­ing these play­ers were found by search­ing Spawn­ing Tool for re­plays from tour­na­ment matches which in­cluded at least one player from the list of fast play­ers. This re­sulted in 51 re­play files.

Sev­eral of the re­play files were too old, so that they could no longer be opened by the cur­rent ver­sion of Star­craft II, and I ig­nored them. Others were ig­nored be­cause they in­cluded play­ers, race matchups, or maps that were already rep­re­sented in other matches. Some were ig­nored be­cause we did not get to them be­fore we had col­lected what seemed to be enough data. This left 15 re­plays that made it into the anal­y­sis.

I opened each file us­ing Scelight, and the time and APM val­ues were recorded for the top three peaks on the graph of that player’s APM, us­ing 5-sec­ond bins. Next, I opened the re­play file in Star­craft II, and for each peak recorded ear­lier, we wrote down whether that player was pri­mar­ily en­gag­ing in com­bat at the time or not. Ad­di­tion­ally, I recorded the time and APM for each player for 2-4 5-sec­ond du­ra­tions of the game in which the play­ers were pri­mar­ily en­gaged in com­bat.

All of the APM val­ues which came from com­bat and from out­side of com­bat were ag­gre­gated into the his­togram shown in the ‘Speed Con­tro­versy’ sec­tion of this ar­ti­cle.

There are sev­eral po­ten­tial sources of bias or er­ror in this:

  1. Our method for choos­ing play­ers and matches may be bi­ased. We were seek­ing ex­am­ples of hu­mans play­ing with speed and pre­ci­sion, but it’s pos­si­ble that by rely­ing on in­put from a rel­a­tively small num­ber of Red­dit users (as well as some per­sonal friends), we missed some­thing.

  2. This mea­sure­ment re­lies en­tirely on my sub­jec­tive eval­u­a­tion of whether the play­ers are mostly en­gaged in com­bat. I am not an ex­pert on the game, and it seems likely that I missed some things, at least some of the time.

  3. The tool I used for this seems to mis­match events in the game by a few sec­onds. Since I was us­ing 5-sec­ond bins, and some­times a player’s APM will change greatly be­tween 5-sec­ond bins, it’s pos­si­ble that this in­tro­duced a sig­nifi­cant er­ror.

  4. The choice of 5 sec­ond bins (as op­posed to some­thing shorter or longer) is some­what ar­bi­trary, but it is what some peo­ple in the Star­craft com­mu­nity were us­ing, so I’m us­ing it here.

  5. Some ac­tions are ex­cluded from the anal­y­sis au­to­mat­i­cally. Th­ese in­clude cam­era up­dates, and this is prob­a­bly a good thing, but I did not look care­fully at the source code for the tool, so it may be do­ing some­thing I don’t know about.