Yep! I addressed this point in footnote .
I just want to share another reason I find this n=1 anecdote so interesting—I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.
Hmm, I’m not sure I understand what point you think I was trying to make. The only case I was trying to make here was that much of our subjective experience which may appear uniquely human might stem from our langauge abilites, which seems consistent with Helen Keller undergoing a phase transition in her subjective experience upon learning a single abstract concept. I’m not getting what age has to do with this.
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it’s evidence that architectural changes matter a lot.
Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.
I remembered reading about this a while back and updating on it, but I’d forgotten about it. I definitely think this is relevant, so I’m glad you mentioned it—thanks!
I think this explanation makes sense, but it raises the further question of why we don’t see other animal species with partial language competency. There may be an anthropic explanation here—i.e. that once one species gets a small amount of language ability, they always quickly master language and become the dominant species. But this seems unlikely: e.g. most birds have such severe brain size limitations that, while they could probably have 1% of human language, I doubt they could become dominant in anywhere near the same way we did.
Can you elaborate more on what partial language competency would look like to you? (FWIW, my current best guess is on “once one species gets a small amount of language ability, they always quickly master language and become the dominant species”, but I have a lot of uncertainty. I suppose this also depends a lot on what exactly what’s meant by “language ability”.)
This seems like a false dichotomy. We shouldn’t think of scaling up as “free” from a complexity perspective—usually when scaling up, you need to make quite a few changes just to keep individual components working. This happens in software all the time: in general it’s nontrivial to roll out the same service to 1000x users.
I agree. But I also think there’s an important sense in which this additional complexity is mundane—if the only sorts of differences between a mouse brain and a human brain were the sorts of differences involved in scaling up a software service to 1000x users, I think it would be fair (although somewhat glib) to call a human brain a scaled-up mouse brain. I don’t think this comparison would be fair if the sorts of differences were more like the sorts of differences involved in creating 1000 new software services.
That’s one of the “unique intellectual superpowers” that I think language confers us:
On a species level, our mastery of language enables intricate insights to accumulate over generations with high fidelity. Our ability to stand on the shoulders of giants is unique among animals, which is why our culture is unrivaled in its richness in sophistication.
(I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I’d made that more front-and-center.)
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against.
This was what I was intending to convey in assumption 3.
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I should clarify a few more background beliefs:
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote:  By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
Corrigibility. Without corrigibility I would be just as scared of Goodhart.