AlphaStar: Impressive for RL progress, not for AGI progress

Deep­Mind re­leased their AlphaS­tar pa­per a few days ago, hav­ing reached Grand­mas­ter level at the par­tial-in­for­ma­tion real-time strat­egy game StarCraft II over the sum­mer.

This is very im­pres­sive, and yet less im­pres­sive than it sounds. I used to watch a lot of StarCraft II (I stopped in­ter­act­ing with Bliz­zard re­cently be­cause of how they rol­led over for China), and over the sum­mer there were many break­downs of AlphaS­tar games once play­ers figured out how to iden­tify the ac­counts.

The im­pres­sive part is get­ting re­in­force­ment learn­ing to work at all in such a vast state space- that took break­throughs be­yond what was nec­es­sary to solve Go and beat Atari games. AlphaS­tar had to have a rich enough set of po­ten­tial con­cepts (in the sense that e.g. a con­volu­tional net ends up hav­ing con­cepts of differ­ent tex­tures) that it could learn a con­cept like “con­struct build­ing P” or “at­tack unit Q” or “stay out of the range of unit R” rather than just “se­lect spot S and en­ter key T”. This is new and worth cel­e­brat­ing.

The over­hyped part is that AlphaS­tar doesn’t re­ally do the “strat­egy” part of real-time strat­egy. Each race has a few solid builds that it ex­e­cutes at GM level, and the unit con­trol is fan­tas­tic, but the re­plays don’t look cre­ative or even es­pe­cially re­ac­tive to op­po­nent strate­gies.

That’s be­cause there’s no rep­re­sen­ta­tion of causal think­ing—“if I did X then they could do Y, so I’d bet­ter do X’ in­stead”. In­stead there are many agents evolv­ing to­gether, and if there’s an agent evolv­ing to try Y then the agents do­ing X will be re­placed with agents that do X’. But to ex­plore as much as hu­mans do of the game tree of vi­able strate­gies, this ap­proach could take an amount of com­put­ing re­sources that not even to­day’s Deep­Mind could af­ford.

(This lack of causal rea­son­ing es­pe­cially shows up in build­ing place­ment, where the con­se­quences of lo­cat­ing any one build­ing here or there are minor, but the con­se­quences of your over­all SimCity are ma­jor for how your units and your op­po­nents’ units would fare if they at­tacked you. In one com­i­cal case, AlphaS­tar had sur­rounded the units it was build­ing with its own fac­to­ries so that they couldn’t get out to reach the rest of the map. Rather than lift­ing the build­ings to let the units out, which is pos­si­ble for Ter­ran, it de­stroyed one build­ing and then im­me­di­ately be­gan re­build­ing it be­fore it could move the units out!)

This means that, first, AlphaS­tar just doesn’t have a de­cent re­sponse to strate­gies that it didn’t evolve, and sec­ondly, it doesn’t do very well at build­ing up a re­ac­tive de­ci­sion tree of strate­gies (if I scout this, I do that). The lat­ter kind of play is un­for­tu­nately very nec­es­sary for play­ing Zerg at a high level, so the in­ter­nal meta has just col­lapsed into one where its Zerg agents pre­dictably rush out early at­tacks that are easy to defend if ex­pected. This has the flow-through effect that its Ter­ran and Pro­toss are weaker against hu­man Zerg than against other races, be­cause they’ve never prac­ticed against a solid Zerg that plays for the late game.

The end re­sult cleaned up against weak play­ers, performed well against good play­ers, but prac­ti­cally never took a game against the top few play­ers. I think that Deep­Mind re­al­ized they’d need an­other break­through to do what they did to Go, and de­cided to throw in the towel while mak­ing it look like they were claiming vic­tory. (Key quote: “Prof Silver said the lab ‘may rest at this point’, rather than try to get AlphaS­tar to the level of the very elite play­ers.”)

Fi­nally, RL prac­ti­tion­ers have known that gen­uine causal rea­son­ing could never be achieved via known RL ar­chi­tec­tures- you’d only ever get some­thing that could ex­e­cute the same policy as an agent that had rea­soned that way, via a very ex­pen­sive pro­cess of evolv­ing away from dom­i­nated strate­gies at each step down the tree of move and coun­ter­move. It’s the biggest known un­known on the way to AGI.