I still tentatively think the lexical update works, but it’s been a while and I might be missing something.
I’ll follow your convention that our universe is U-simple, our universal prior is U’, and so the aliens’ universe is U’-simple (I think—sorry if I got confused and that’s not what you mean).
If we sample from a universe that is U’-simple, then:
Assume the aliens care about U’-simplicity. They will preferentially sample from U’, and so have U’(our world) mass on our world. Within that, they will correctly guess that the machine they are supposed to control is using U’ as its prior. That is, they basically pay U’(our world) * P(us|someone using U’ to predict).
But our universal prior was also U’, wasn’t it? So we are also paying U’(our world) to pick out our world. I.e. we pay U’(our world) * P(someone making important predictions | our world) * P(someone using U’ to predict | someone making important predictions) * P(us|someone using U’ to predict).
I don’t see any program whose behavior depends on U(world) for the “real” simplicity prior U according to which our world is simple (and that concept seems slippery).
Does that seem right?
(Note: this is a post from 2014 that I recently added to ai-alignment.com. I still think it’s a very interesting scheme and I’m excited about people exploring better mechanisms for resolving arguments.)
I think the resulting odds won’t reflect the probability of anything, because they depend a lot on whether Alice or Bob is more risk-tolerant (=rich).
If one of them is willing to tolerate risk equal to the value of Judy’s time to hear out the argument, then you are fine. If the total willingness to risk of people who believe “Judy will believe X on reflection” is lower than the value of Judy’s time, then I think you are basically inevitably stuck unless Judy is willing to risk her own attention. If she is willing to risk her own attention, then she can just give people a budget of “minutes” to spend making wagers, as discussed in the post, and as long as the budget is large enough relative to the size of the disagreement it seems like you are OK.
Also, it seems to me that your scheme works best for yes/no questions. For anything more complicated, Alice and Bob can cooperate to mislead Judy, which is especially scary in case of AIs. I’m not sure how to fix that problem: it seems to require a way for a non-expert to check the work of a malicious expert, not just adjudicate between two experts.
The scheme works if one of the experts advocates for the truth. If there are two options, and both players want to manipulate Judy into believing “yes,” then you are similarly in trouble. I agree that if there are more options than experts then it becomes less likely that “by chance” someone wants to advocate for the right answer. But I think in general you are banking on there being some density of experts who want to argue for the truth because it is the truth.
For context, here’s the one time in the interview I mention “AI risk” (quoting 2 earlier paragraphs for context):
Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.
(But it’s still the case that asked “Can you explain why it’s valuable to work on AI risk?” I responded by almost entirely talking about AI alignment, since that’s what I work on and the kind of work where I have a strong view about cost-effectiveness.)
E.g. if you have a broad distribution over possible worlds, some of which are “fragile” and have 100 things that cut value down by 10%, and some of which are “robust” and don’t, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.
Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world—if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it’s a reasonably robust conclusion across many different frameworks: your decision shouldn’t end up being dominated by some hugely conjunctive event.
In the case of something like amplification or debate, I think the bet that you’re making is that language modeling alone is sufficient to get you everything you need in a competitive way.
I’m skeptical of language modeling being enough to be competitive, in the sense of maximizing “log prob of some naturally occurring data or human demonstrations.” I don’t have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I’m also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What’s so bad if we use non-language data?
We could also ask: “Would AlphaStar remain as good as it is, if fine-tuned to answer questions?”
In either case it’s an empirical question. I think the answer is probably yes if you do it carefully.
You could imagine separating this into two questions:
Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it’s reasonably likely; the goal is only to answer questions the model “already knows,” so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of “know” and question-answering and so on.
Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.
I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer “knows everything the agent knows.” They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.
(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn’t use statistical regularities from the “main” objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn’t get a good reward function or specification of catastrophically bad behavior.)
I don’t have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn’t refined enough to appreciate those differences. For any particular dumb mistake I’d be surprised if the line between not making it and making it was in that particular doubling.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don’t see this as affecting the basic point.)
Thanks! That’s only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP’s example was legit and I overestimated the RL agent.
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn’t get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!
It seems like AlphaStar played 90 ladder matches as Terran:
30 with the initial policy trained with SL
30 with the a policy from the middle of training
30 from the final RL policy.
This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it’s not clear whether RL would learn how to recover (I would expect it to, but not too strongly).
If it’s easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I’ve been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar’s skill.
It looks like they released 90 replays and played 90 ladder games so it should be possible to check.
The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it’s one of the SL matches it’s either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I’d check if I had SC2 installed.
(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that’s been posted somewhere, I haven’t looked and it’s hard to know who to trust.)
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from “if I see X, I do Y”?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I’d be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it’s going to take bigger models in order to reach a similar level of performance. I don’t see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you’d only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It’s the biggest known unknown on the way to AGI.
What’s the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn’t do any causal reasoning? That it only ever learns to play parts of the game tree it’s seen during training? What does “genuine causal reasoning” even mean?
It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it’s going to take a long time—AlphaStar was very expensive, and still only trained a policy with ~80M parameters.
From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.
In what sense is this information warfare or even misleading? The second sentence of the blog post says: “AlphaStar was ranked above 99.8% of active players,” which seems quite clear. They seem to have done a pretty good job of making that comparison as fair as you could expect. What do they say or even imply which is highly misleading?
Perhaps they say “Grandmaster level,” and it’s possible that this gives a misleading impression to people who don’t know what that term means in Starcraft? Though I think chess grandmaster also means roughly “better than 99.8% of ladder players,” and the competitive player pools have similar size. So while it might be misleading in the sense that Chess has a larger player pool a smaller fraction of whom are competitive, it seems fairly straightforward.
(edited post to include)