I think you’re moving the goal-posts, since before you mentioned “without external calculators”. I think external tools are likely to be critical to doing this, and I’m much more optimistic about that path to doing this kind of robust generalization. I don’t think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.
capybaralet(David Krueger)
I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.
RE GPT-3, etc. doing well on math problems: the key word in my response was “robustly”. I think there is a big qualitative difference between “doing a good job on a certain distribution of math problems” and “doing math (robustly)”. This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.
Are you aware of any examples of the opposite happening? I guess it should for some tasks.
I can interpret your argument as being only about the behavior of the system, in which case:
- I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning.
- I’m somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible.
I can also interpret your argument as being about the internal reasoning of the system, in which case:
- I put this in the “deep learning is magic” bucket of arguments; it’s much better articulated than what we said though, I think...
- I am quite skeptical of these arguments, but still find them plausible. I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question ‘when can/do foundation models internalize explicitly stated knowledge’)
I basically agree.
I am arguing against extreme levels of pessimism (~>99% doom).
While I share a large degree of pessimism for similar reasons, I am somewhat more optimistic overall.
Most of this comes from generic uncertainty and epistemic humility; I’m a big fan of the inside view, but it’s worth noting that this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.
However, there are some more specific points I can point to where I think you are overconfident, or at least not providing good reasons for such a high level of confidence (and to my knowledge nobody has). I’ll focus on two disagreements which I think are closest to my true disagreements.
1) I think safe pivotal “weak” acts likely do exist. It seems likely that we can access vastly superhuman capabilities without inducing huge x-risk using a variety of capability control methods. If we could build something that was only N<<infinity times smarter than us, then intuitively it seems unlikely that it would be able to reverse engineer details of the outside world or other AI systems source code (cf 35) necessary to break out of the box or start cooperating with its AI overseers. If I am right, then the reason nobody has come up with one is because they aren’t smart enough (in some—possibly quite narrow—sense of smart); that’s why we need the superhuman AI! Of course, it could also be that someone has such an idea, but isn’t sharing it publicly / with Eliezer.
2) I am not convinced that any superhuman AGI we are likely to have the technical means to build in the near future is going to be highly consequentialist (although this does seem likely). I think that humans aren’t actually that consequentialist, current AI systems even less so, and it seems entirely plausible that you don’t just automatically get super consequentialist things no matter what you are doing or how you are training them… if you train something to follow commands in a bounded way using something like supervised learning, maybe you actually end up with something that does something reasonably close to that. My main reason for expecting consequentialist systems at superhuman-but-not-superintelligent-level AGI is that people will build them that way because of competitive pressures, not because systems that people are trying to make non-consequentialist end up being consequentialist.
These two points are related: If we think (2), then we should be more skeptical of (1), although we could still hope to use capability control and incentive schemes to harness a superhuman-but-not-superintelligent consequentialist AGI to devise and help execute “weak” pivotal acts.
3) Maybe one more point worth mentioning is the “alien concepts” bit: I also suspect AIs will have alien concepts and thus generalize in weird ways. Adversarial examples and other robustness issues are evidence in favor of this, but we are also seeing that scaling makes models more robust, so it seems plausible that AGI will actually end up using similar concepts to humans, thus making generalizing in the ways we intend/expect natural for AGI systems.
---------------------------------------------------------------------
The rest of my post is sort of just picking particular places where I think the argumentation is weak, in order to illustrate why I currently think you are, on net, overconfident.7. The reason why nobody in this community has successfully named a ‘pivotal weak act’ where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later—and yet also we can’t just go do that right now and need to wait on AI—is that nothing like that exists.
This contains a dubious implicit assumption, namely: we cannot build safe super-human intelligence, even if it is only slightly superhuman, or superhuman in various narrow-but-strategically-relevant areas.
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
This basically what CIRL aims to do. We can train for this sort of thing and study such methods of training empirically in synthetic settings.
23. Corrigibility is anti-natural to consequentialist reasoning
Maybe I missed it, but I didn’t see any argument for why we end up with consequentialist reasoning.
30. [...] There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.
It seems like such things are likely to exist by analogy with complexity theory (checking is easier than proposing).
36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.
I figured it was worth noting that this part doesn’t explicitly say that relatively weak AGIs can’t perform pivotal acts.
What about graphics? e.g. https://twitter.com/DavidSKrueger/status/1520782213175992320
This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).
I didn’t understand what you mean by the line being blurrier… Is this a comment about what works in practice for imitation learning? Does a similar objection apply if we replace imitation
learning with behavioral cloning?
Weight-sharing makes deception much harder.
Can I read about that somewhere? Or could you briefly elaborate?
I’ll be at EAG this weekend.
It’s possible that a lot of our disagreement is due to different definitions of “research on alignment”, where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety.
To push back on that a little bit...
RE (1): It’s not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic.
RE (2): This is basically circular.
In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens.
I strongly disagree with that and I don’t think it follows from the premise. I think by most reasonable definitions of alignment it is already the case that most of the research is not done by x-risk motivated people.
Furthermore, I think it reflects poorly on this community that this sort of sentiment seems to be common.
like “a normal distribution”, thx!
quick take: Roughly speaking adversarial examples are the Modern Reformulation you’re asking about.
In my mind the main issue here is that we probably need extreme levels of robustness / OOD-catching. And these probably only come much too late, after less-cautious actors have deployed AI systems that induce lots of x-risk.
I see an implicit premise I disagree with about the value of improving communication within the rationalist community vs. between rationalists and outsiders; It seems like I think the latter is relatively more important than you do.
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
I guess actually the goal is just to get something aligned enough to do a pivotal act. I don’t see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn’t work for that.
Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall.
I strongly disagree with inner alignment being the correct crux. It does seem to be true that this is in fact a crux for many people, but I think this is a mistake. It is certainly significant.
But I think optimism about outer alignment and global coordination (“Catch-22 vs. Saving Private Ryan”) is much bigger factor, and optimists are badly wrong on both points here.
I don’t think that’s true.