I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:
Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect. (My reading of “Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability.” is completely consistent with this.)
I didn’t understand what your wrote about verification well enough to have anything to say.
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?
The inner process may nevertheless use TDT if TDT doesn’t diverge from CDT on the training distribution, or it might learn to use TDT but “look nice” so that it doesn’t get selected against.
This was what I was intending to convey in assumption 3.
I’m currently intuiting that there’s a broad basin of “seeming corrigible until you can perform a treacherous turn”, but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent’s justifications for its behavior, and something like its verbal loop, I don’t see how things can work out.
Here are my assumptions underlying this intuition:
1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer’s comment.)
2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they’ll get.
3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible.
4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn.
5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent’s cognition is necessary to spot this sort of malign reasoning.
6. We can’t achieve this level of understanding via anything like current ML transparency techniques.
Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?
2. How easy is it to learn to be corrigible? I’d think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?
My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I should clarify a few more background beliefs:
I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that’s competently optimizing for our values for 1,000,000,000 years. (I’d want an aligned, foomed AGI to be doing that.)
I agree that if we’re not doing anything as ambitious as that, it’s probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don’t have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I’d give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)
Let me now clarify what I mean by “foomed AGI”:
A rough summary is included in my footnote:  By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. (“Optimally optimized optimizer” is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it’s more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this “nuclear explosion” of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.
In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like “Just don’t build a foomed AGI. Just like it’s way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it’s way too hard to build a safe foomed AGI, so let’s just not do it”. And my position is something like “It’s probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let’s do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right.”
I’m happy to delve into your individual points, but before I do so, I’d like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.
Corrigibility. Without corrigibility I would be just as scared of Goodhart.
This seems like it’s using a bazooka to kill a fly. I’m not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?
I agree that zero-shot reasoning doesn’t save us from daemons by itself, and I think there’s important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.
Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?
The daemons I’m focusing on here mostly arise from embedded agency, which Solomonoff induction doesn’t capture at all. (It’s worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and “internal politics”/”embedded agency” daemons.) I’m interested in hashing this out further, but probably at some future point, since this doesn’t seem central to our disagreement.
But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is “the AGI was incompetent at some point, made a mistake, and bad things happened”. I don’t know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to “want” to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
If you’re trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can’t build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)
I’m surprised by how much we seem to agree about everything you’ve written here. :P Let me start by clarifying my position a bit:
When I imagine the AGI making a “plan that will work in one go”, I’m not imagining it going like “OK, here’s a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!” I’m imagining the plan to look more like “set a bunch of things in motion, reevaluate and update it based on where things are, and repeat”. So the overall shape of this AGI’s cognition will look something like “execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.”, happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn’t introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years’ worth of human cognition without any catastrophic failures.
I’m less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It’s plausible that the latter usually leads to the former (particularly if there’s a fast takeoff on Earth that completes in a few hours), but that’s mostly not what I’m concerned about.
In terms of actual disagreement, I suspect I’m much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it’s doing something like 1,000,000 years’ worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What’s your take on this?
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
I see. Given this, I think “zero-shot learning” makes sense but “zero-shot reasoning” still doesn’t, since in the former “zero” refers to “zero demonstrations” and you’re learning something without doing a learning process targeted at that specific thing, whereas in the latter “zero” isn’t referring to anything and you’re trying to get the reasoning correct in one attempt so “one-shot” is a more sensible description.
I was imagining something like “zero failed attempts”, where each failed attempt approximately corresponds to a demonstration.
Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what’s your conditional probability P(successful coordination to stop AI progress | slow takeoff)?
More like, conditioning on getting international coordination after our first AGI, P(safe intelligence explosion | slow takeoff) is a lot higher, like 80%. I don’t think slow takeoff does very much to help international coordination.
1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?
This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.
I’m mostly concerned with daemons, not utility functions changing in random directions. If I knew that corrigibility were robust and that a corrigible AI would never encounter daemons, I’d feel pretty good about it recursively self-improving without formal zero-shot reasoning.
You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don’t expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task—the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).
I’m imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it’s performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.
In particular, the one context where we’re most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs—and yet, daemons arise.
I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.
I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.
In my mind, there’s no clear difference between preventing daemons and securing complex systems. For example, I think there’s a fundamental similarity between the following questions:
How can we build an organization that we trust to optimize for its founders’ original goals for 10,000 years?
How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it “goes insane”, it will not be optimizing for its original goals.)
How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?
I think of AGIs as implementing miniature societies teeming with subagents that interact in extraordinarily sophisticated ways (for example they might play politics or Goodhart like crazy). On this view, ensuring the robustness of an AGI entails ensuring the robustness of a society at least as complex as human society, which seems to me like it requires zero-shot reasoning.
It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms). Maybe it’s just a failure of my imagination, but I can’t think of any way to accomplish even this task without delegating it to a skilled zero-shot reasoner.
Why “zero-shot”? You’re talking about getting something right in one try, so wouldn’t “one-shot” make more sense?
I’ve flip-flopped between “one-shot” and “zero-shot”. I’m calling it “zero-shot” in analogy with zero-shot learning, which refers to the ability to perform a task after zero demonstrations. “One-shot reasoning” probably makes more sense to folks outside of ML.
I think this paragraph gives an overly optimistic impression of how much progress has been made. We are still very confused about what probabilities really are, we haven’t made any progress on the problem of Apparent Unformalizability of “Actual” Induction, and decision theory seems to have mostly stalled since about 8 years ago (the MIRI paper you cite does not seem to represent a substantial amount of progress over UDT 1.1).
I used “substantial progress” to mean “real and useful progress”, rather than “substantial fraction of the necessary progress”. Most of my examples happened in the eary to mid-1900s, suggesting that if we continue at that rate we might need at least another century.
This isn’t obvious to me. Can you explain why you think this?
I’d feel much better about delegating the problem to a post-AGI society, because I’d expect such a society to be far more stable if takeoff is slow, and far more capable of taking its merry time to solve the full problem in earnest. (I think it will be more stable because I think it would be much harder for a single actor to attain a decisive strategic advantage over the rest of the world.)
To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?
Actually I would be surprised if that’s the case, and I think it’s plausible that large teams of scientists thinking for one week each could safely replicate arbitrary human intellectual progress.
But if you replaced 100,000 scientists thinking for a week each with 1,000,000,000,000 scientists thinking for 10 minutes each, I’d feel more skeptical. In particular I think 10,000,000 10-minute scientists can’t replicate the performance of one 1-week scientist, unless the 10-minute scientists become human transistors. In my mind there isn’t a qualitative difference between this scenario and the low-bandwidth oversight scenario. It’s specifically dealing with human transistors that I worry about.
I also haven’t thought too carefully about the 10-minute-thought threshold in particular and wouldn’t be too surprised if I revised my view here. But if we replaced “10,000,000 10-minute scientists” with “arbitrarily many 2-minute scientists” I would even more think we couldn’t assemble the scientists safely.
I’m assuming in all of this that the scientists have the same starting knowledge.
There’s an old SlateStarCodex post that’s a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists’ epistemic processis fundamentally similar to that of the alchemists. And the alchemists’ thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH’s.
I also imagine them to be far more constraining if “thought-lifespans” shrank from ~30 years to ~30 minutes. But “thought-lifespans” on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.
I don’t understand at all how that could be true for brain uploading at the scale of a week vs. year.
Solving this problem considering multiple possible approaches. Those can’t be decomposed with 100% efficiency, but it sure seems like they can be split up across people.
Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.
Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.
I agree with all this.
EDIT: In summary, my view is that:
if all the necessary intellectual progress can be distilled into individual scientists’ heads, I feel good about HCH making a lot of intellectual progress
if the agents are thinking long enough (1 week seems long enough to me, 30 minutes doesn’t), this distillation can happen.
if this distillation doesn’t happen, we’d have to end up doing a lot of cognition on “virtual machines”, and cognition on virtual machines is unsafe.
You’re right—I edited my comment accordingly. But my confusion still stands. Say the problem is “figure out how to upload a human and run him at 10,000x”. On my current view:
(1) However you decompose this problem, you’d need something equivalent to at least 1 year’s worth of a competent scientist doing general reasoning to solve this problem.
(2) In particular, this general reasoning would require the ability to accumulate new knowledge and synthesize it to make novel inferences.
(3) This sort of reasoning would end up happening on a “virtual machine AGI” built out of “human transistors”.
(4) Unless we know how to ensure cognition is safe (e.g. daemon-free) we wouldn’t know how to make safe “virtual machine AGI’s”.
(5) So either we aren’t able to perform this reasoning (because it’s unsafe and recognized as such), or we perform it anyway unsafely, which may lead to catastrophic outcomes.
Which of these points would you say you agree with? (Alternatively, if my picture of the situation seems totally off, could you help show me where?)
D-imitations agglomerate to sufficient cognitive power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property.
This is the crux I currently feel most skeptical of. I don’t understand how we could safely decompose the task of emulating 1 year’s worth of von Neumann-caliber general reasoning on some scientific problem. (I’m assuming something like this is necessary for a pivotal act; maybe it’s possible to build nanotech or whole-brain emulations without such reasoning being automated, in which case my picture for the world becomes rosier.) (EDIT: Rather than “decomposing the task of emulating a year’s worth of von Neumann-caliber general reasoning”, I meant to say “decomposing any problem whose solution seems to require 1 year’s worth of von Neumann-caliber general reasoning”.)
In particular, I’m still picturing Paul’s agenda as implementing some form of HCH, and I don’t understand how anything that looks like an HCH can accumulate new knowledge, synthesize it, and make new discoveries on top of it, without the HCH-humans effectively becoming “human transistors” that implement an AGI. (An analogy: the HCH-humans would be like ants; the AGI would be like a very complicated ant colony.) And unless we know how to build a safe AGI (for example we’d need to ensure it has no daemons), I don’t see how the HCH-humans would know how to configure themselves into a safe AGI, so they just wouldn’t (if they’re benign).
Oops, I think I was conflating “corrigible agent” with “benign act-based agent”. You’re right that they’re separate ideas. I edited my original comment accordingly.
X-and-only-X is what I call the issue where the property that’s easy to verify and train is X, but the property you want is “this was optimized for X and only X and doesn’t contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system”.
If X is “be a competent, catastrophe-free, corrigible act-based assistant”, it’s plausible to me that an AGI trained to do X is sufficient to lead humanity to a good outcome, even if X doesn’t capture human values. For example, the operator might have the AGI develop the technology for whole brain emulations, enabling human uploads that can solve the safety problem in earnest, after which the original AGI is shut down.
Being an act-based (and thus approval-directed) agent is doing a ton of heavy lifting in this picture. Humans obviously wouldn’t approve of daemons, so your AI would just try really hard to not do that. Humans obviously wouldn’t approve of a Rubik’s cube solution that modulates RAM to send GSM cellphone signals, so your AI would just try really hard to not do that.
I think most of the difficulty here is shoved into training an agent to actually have property X, instead of just some approximation of X. It’s plausible to me that this is actually straightforward, but it also feels plausible that X is a really hard property to impart (though still much easier to impart than “have human values”).
A crux for me whether property X is sufficient is whether the operator could avoid getting accidentally manipulated. (A corrigible assistant would never intentionally manipulate, but if it satisfies property X while more directly optimizing Y, it might accidentally manipulate the humans into doing some Y distinct from human values.) I feel very uncertain about this, but it currently seems plausible to me that some operators could successfully just use the assistant to solve the safety problem in earnest, and then shut down the original AGI.