I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
I think Rohin’s misunderstanding about corrigibility, aka his notion of Paul!Corrigibility, doesn’t actually come from Paul but from the Risks from Learned Optimization (RFLO) paper[1]:
3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer’s epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer’s intentions).
It seems to me like the authors here just completely misunderstood what corrigibility is about. I think in their ontology, “corrigibly aligned to human values” just means “pointed at indirect normativity (aka human-CEV)”, aka indirectly caring about human values by valuing whatever they infer humans value (as opposed to directly valuing the same things as humans for the same complex reasons[2]).
(Paul’s post seems to me like he might have a correct understanding of corrigibility, and iiuc suggests corrigibility could also be used as avenue to aligning AI to human values, because we will be able to correct the AI for longer/at-higher-capability-levels if it is corrigible. EDIT: Actually not sure, perhaps he rather means that the AI will end up coherently corrigible from training for corrigibility, that it will converge to that even if we haven’t managed to write down a utility function for corrigibility.)
- ^
IIRC the RFLO paper also caused some confusion in me when I started learning about corrigibility.
- ^
Though not that this kind of “direct alignment” doesn’t necessarily correspond to what they call “internalized alignment”. Their ontology doesn’t make sense to me. (E.g. I don’t see what concretely Evan might mean with “the information came through the base optimizer”.)
- ^
Hi,
sorry for commenting without having read most of your post. I just started reading this and thought like “isn’t this exactly what the corrigibility agenda is/was about?”, and in your “relation to other agendas” section you don’t mention corrigibility there, so I thought I just ask whether you’re familiar with it and how your approach is different. (Though tbc, I could be totally misunderstanding, I didn’t read far.)
Tbc I think further work on corrigibility is very valuable, but if you haven’t looked into it much I’d suggest reading up on what other people wrote on that so far. (I’m not sure whether there are very good explainers, and sometimes people seem to get a wrong impression of what corrigibility is about. E.g. corrigibility has nothing to do with “corrigibly aligned” from the “Risks from Learned Optimization” paper. Also the shutdown problem is often misunderstood too. I would make read and try to understand the stuff MIRI wrote about it. Possibly parts of this conversation might also be helpful, but yeah sry it’s not written in a nice format that explains everything clearly.)
we may be able to avoid this problem by:
not building unbounded, non-interruptible optimizers
and, instead,building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that “we”, though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.
This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
If you have an unpublished draft, do you want to share it with me? I could then sometime the next 2 weeks read both your old post and the new one and think whether I have any more objections.
List of my LW comments I might want to look up again. I just thought I keep this list public on my shortform in case someone is unusually interested in stuff I write. I’ll add future comments here too. I didn’t include comments on my shortform here.:
First, apologies for my rant-like tone. I reread some MIRI conversations 2 days ago and maybe now have a bad EY writing style :-p. Not that I changed my mind yet, but I’m open to.
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
Sry should’ve clarified. I use “world” here as in “reality=the world we are in” and “counterfactual = a world we are not in”. Worlds can be formalized as computer programs (where the agent can be a subprogram embedded inside). Our universe/multiverse would also be a world, which could e.g. be described through its initial state + the laws of physics, and thereby encompass the full history. Worlds are conceivable trajectories, but not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds. Probably I’m bad at explaining.
I mean I think Eliezer endorses computationalism, and would imagine utilities as sth like “what subcomputations in this program do i find valuable?”. Maybe he thinks it’s usual that it doesn’t matter where a positive-utility-subcomputation is embedded within a universe. But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
Also, utility over outcomes actually doesn’t necessarily mean it’s just about a future state. You could imagine the outcome space including outcomes like “the amount of happiness units I received over all timesteps is N”, and maybe even more complex functions on histories. Though I agree it would be sloppy and confusing to call it outcomes.
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking.
Wasn’t thinking that.
And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right?
I don’t quite know. I think there are assumptions there about your preferences about the different kinds of pizza not changing over the time of the trades, and about not having other preferences about trading patterns, and maybe a bit more.
I agree that just having preferences about some future state isn’t a good formalism, and I can see that if you drop that assumption and allow “preferences over trajectories” the conclusion of the post might seem vacuous because you can encode anything with “utilities over trajectories”. But maybe the right way to phrase it is that we assume we have preferences over worlds, and those are actually somewhat more constrained through what kind of worlds are consistent. I don’t know. I don’t think the post is a great explanation.
I didn’t see the other links you included as significant evidence for your hypothesis (or sometimes not at all), and I think the corrigibility paper is more important.
But overall yeah, there is at least an assumption that utilities are about worlds. It might indeed be worth questioning! But it’s not obvious to me that you can have a “more broad preferences” proposal that still works. Maybe an agent is only able to do useful stuff in so far it has utilities over worlds, and the parts of it that don’t have that structure get money pumped by the parts that are.
I don’t know whether your proposal can be formalized as having utilities over worlds. It might be, but it might not be easy.
I don’t know whether you would prefer to take the path of “utilities over worlds is too constrained—here are those other preferences that also work”, or “yeah my proposal is about utilities over worlds”. Either way I think a lot more concreteness is needed.
Objection 2: What if the AI self-modifies to stop being corrigible? What if it builds a non-corrigible successor?
Presumably a sufficiently capable AI would self-modify to stop being corrigible because it planned to, and such a plan would certainly score very poorly on its “the humans will remain in control” assessment. So the plan would get a bad aggregate score, and the AI wouldn’t do it. Ditto with building a non-corrigible successor.
I should clarify what I thought you were claiming in the post:
From my perspective, there are 2 ways to justify corrigibility proposals:
Argue concretely why your proposal is sufficient to reach pivotal capability level while remaining safe.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I understood you as claiming your proposal fulfills the second thing.
Your answer to Objection 2 sounds to me pretty naive:
How exactly do you aggregate goal- and helpfulness-preferences? If you weigh helpfulness heavily enough that it stays safe, does it then become useless?
Might the AI still prefer plans that make it less likely for the human to press the shutdown button? If so, doesn’t it seem likely that the AI will take other actions that don’t individually seem too unhelpful and eventually disempower humans? And if not, doesn’t it mean the AI would just need to act on the standard instrumental incentives (except not-being-shut-down) of the outcome-based-goal, which would totally cause the operators to shut the AI down? Or how exactly is helpfulness supposed to juggle this?
And we’re not even getting started into problems like “use the shutdown button as outcome pump” as MIRI considered in their corrigibility paper. (And they considered more proposals privately. E.g. Eliezer mentions another proposal here.)
But maybe you actually were just imagining a human-level AI that behaves corrigibly? In which case I’m like “sure but it doesn’t obviously scale to pivotal level and you haven’t argued for that yet”.ADDED: On second thought, perhaps you were thinking the approach scales to working for pivotal-level brain-like AGI. This is plausible but by no means close to obvious to me. E.g. maybe if you scale brain-like AGI smart enough it start working in different ways than were natural, e.g. using lots of external programs to do optimization. And maybe then you’re like “the helpfulness accessor wouldn’t allow running too dangerous programs because of value-drift worries”, and then I’m like “ok fair seems like a fine assumption that it’s still going to be capable enough, but how exactly do you plan the helpfulness drive to also scale in capability as the AI becomes smarter? (and i also see other problems)”. Happy to try to concretize the proposal together (e.g. builder-breaker-game style).
Just hoping that you don’t get what seems to humans like weird edge instantiations seems silly if you’re dealing with actually very powerful optimization processes. (I mean if you’re annoyed with stupid reward specification proposals, perhaps try to apply that lens here?)
It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
So this seems to me like you get a utility score for the first, a utility score for the second, and you try to combine those in some way so it is both safe and capable. It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math, and you can try to strip down the fuzzy real-world parts by imagining toy environments that still capture the relevant essence. Which is what MIRI tried, and also what e.g. Max Harms tried with “empowerment”.
Concepts like “corrigibility” are often used somewhat used inconsistently. Perhaps you’re like “we can just let the AI do the rebinding to better definitions to corrigibility”, and then I’m like “It sure sounds dangerous to me to let a sloppily corrigible AI try to figure out how to become more corrigible, which involves thinking a lot of thoughts about how the new notion of corrigibility might break, and those thoughts might also break the old version of corrigibility. But it’s plausible that there is a sufficient attractor that doesn’t break like that, so let me think more about it and possibly come back with a different problem.”. So yeah your proposal isn’t obviously unworkable, but given that MIRI failed it’s apparently not as easy to find a concrete coherent version of corrigibility, and if we start out with a more concrete/formal idea of corrigibility it might be a lot safer.
ADDED:
And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI.
I previously didn’t clearly disentangle this, but what I want to discuss here are the corrigibility aspects of your proposal, not the alignment aspects (which I am also interested in discussing but perhaps separately on your other post). E.g. it’s fine if you assume some way to point the AI, like MIRI assumed we can set the utility function of the AI.
Even just for the corrigibility part, I think you’re being too vague and that it’s probably quite hard to get a powerful optimizer that has the corrigibility properties you imagine even if the “pointing to pattern match helpfulness” part works. (My impression was that you sounded relatively optimistic about this in your post, and that “research direction” mainly was about the alignment aspects.)
(Also I’m not saying it’s obvious that it likely doesn’t work, but MIRI failing to find a coherent concrete description of corrigibility seems like significant evidence to me.)
- May 17, 2025, 12:03 PM; 13 points) 's comment on Towards_Keeperhood’s Shortform by (
I think the discussions here and especially here are strong evidence that at least Eliezer & Nate are expecting powerful AGIs to be pure-long-term-consequentialist.
I guess by “pure-long-term-consequentialist” you mean “utilities over outcomes/states”? I am quite sure that they think in a proper agent formulation, utilities aren’t over outcomes, but (though here I am somewhat less confident) over worlds/universes. (In the embedded agency context they imagined stuff like an agent having a decision function, imagining possible worlds depending on the output of the decision function, and then choosing the output of the decision function in a way it makes those worlds they prefer most “real” while leaving the other couterfactual. Though idk if it’s fully formalized. I don’t work on embedded agency.)
Though it’s possible they decided that for thinking about pivotal acts we want an unreflective optimizer with a taskish goal where it’s fine to model it as having utilities over an outcome. But I would strongly guess that wasn’t the case when they did their main thinking on whether there’s a nice core solution to corrigibility.
I surmise they have a (correct) picture in their head of how super-powerful a pure-long-term-consequentialist AI can be—e.g. it can self-modify, it can pursue creative instrumental goals, it’s reflectively stable, etc.—but they have not similarly envisioned a partially-but-not-completely-long-term-consequentialist AI that is only modestly less powerful (and in particular can still self-modify, can still pursue creative instrumental goals, and is still reflectively stable).
I assume you intend for your corrigibility proposal to pass the shutdown criterion, aka that the AI shuts down if you ask it to shut down, but otherwise doesn’t manipulate you into shutting it down or using the button as an outcome pump which has unintended negative side effects.
I think it first was silly to not check prior work on corrigibility, but given that it seems like your position was “maybe my corrigibility proposal works because MIRI only considered utility functions on outcomes”.
Then you learn through ADifferentAnonymous that actually MIRI did consider other kinds of preferences when they think about corrigibility and you’re just like “nah but it didn’t seem to me that way from the abstract discussions so maybe they didn’t do it properly”. (To be fair I didn’t reread the discussions. If you have quotes that back up your argument that EY imagines utilities only over outcomes, then perhaps provide them. To me reading this post makes me think you utterly flopped on their ITT, though I could be wrong.)
I think your proposal is way too abstract. If you think it’s actually coherent you should write it down in math.
If you think your proposal actually works, you should be able to stop down real-world complexity and imagine a simple toy environment which still captures the relevant properties, and then write the “the human remains in control” utility function down in math. Who knows, maybe you can actually do it. Maybe MIRI made a silly assumption about accepting too little capability loss where you can demonstrate that the regret is actually low or sth. Maybe your intuitions about “the humans remain in control” would cristallize into something like human empowerment here, and maybe that’s actually good (I don’t know whether it is, i haven’t tried to deeply understand it). (I generally haven’t looked deep into corrigibility, and it is possible MIRI did some silly mistake/assumption. But many people who post about corrigibility or the shutdown problem don’t actually understand the problem MIRI was trying to solve and e.g. let the AI act based on false beliefs.)
MIRI tried a lot harder and a lot more concretely on corrigibility than you, you cannot AFAIK point to a clear reason why your proposal should work when they failed, and my default assumption would be that you underestimate the difficulty by not thinking concretely enough.
It’s important to make concrete proposals so they can be properly criticized, rather than the breaker needing to do the effortful work of trying to steelman and concretize the proposal (and then for other people than you, people often are like “oh no that’s not the version i mean, my version actually works”). Sure, maybe there is a reflectively consistent version of your proposal, but probably most of the work in pinning it down is still ahead of you.
So I’m basically saying that from the outside, and perhaps even from your own position, the reasonable position is “Steve probably didn’t think concretely enough to appreciate the ways in which having both consequentialism and corrigibility is hard”.
My personal guess would even be that MIRI tried pretty exactly that what you’re suggesting in this post, and that they tried a lot more concretely and noticed that it’s not that easy.
Just because we haven’t seen a very convincing explanation for why corrigibility is hard doesn’t mean it isn’t actually hard. One might be able to get good intuitions about it by working through lots and lots of concrete proposals, and then still not be able to explain it that well to others.
(Sorry for mini-ranting. I actually think you’re usually great.)
Oops ok then i guess i will cancel my order
I just pre-ordered 10 copies. Seems like the most cost effective way to help that I’ve seen in a long time. (Though yes I’m also going to try to distribute my copies.)
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
(No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
(Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about.
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist.
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
Thanks.
Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it’s a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)
So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.
Let’s look at how an agent might accomplish a very difficult goal, where the agent didn’t accomplish similar goals yet so the value function doesn’t already assign higher valence to subgoals:
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
It’s easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like “apply the ‘clarify goal’ strategy to make progress towards the subgoal ‘evaluate whether training for corrigibility might work to safely perform a pivotal act’, which has expected utility X”.
So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward.
The expected utility could be calculated by having the world model see what value (aka expected reward/utility) the value function assigns to the endgoal, and then backpropagating expected utility estimates for subgoals based on how likely and given what resources the larger goal could be accomplished given the smaller goal.
However, the value function is stupid and often not very coherent given some simplicity assumptions of the world model. E.g. the valence of the outcome “1000 lives get saved” isn’t 1000x higher than of “1 life gets saved”.
So the world model’s expected utility estimates come apart from the value function estimates. And it seems to me that for very smart and reflective people, which difficult goals they achieve depend more on their world model’s expected utility guesses, rather than their value function estimates. So I wouldn’t call it “the agent works as we expect model-based RL agents to work”.
(And I expect this kind of “the world model assigns expected utility guesses” may be necessary to get to pivotal capability if the value function is simple, though not sure.)
Thx.
Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff.
I don’t really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won’t have sufficient understanding to judge whether what the AI did is actually good.
Thanks! It’s nice that I’m learning more about your models.
I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”.
(A) seems much more general than what I would call “reward specification failure”.
The way I use “reward specification” is:
If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it better fits the reward data, then it’s a reward specification problem.
If the AI has as goal “get reward” (or sth else) rather than “whatever humans want” because it fits the reward data about similarly well and it’s the simpler goal given the architecture, it’s NOT a reward specification problem.
(This doesn’t seem to me to fit your description of “B”.)
(Related.)
I might count the following as reward specification problem, but maybe not, maybe another name would be better:
The AI mostly gets reward for solving problems which aren’t much about human values specifically, so the AI may mainly learn to value insights for solving problems better rather than human values.
(B) seems to me like an overly specific phrasing, and there are many stages where misgeneralization may happen:
when the AI transitions to thinking in goal-directed ways (instead of following more behavioral heuristics or value function estimates)
when the AI starts modelling itself and forms a model of what values it has (where the model might mismatch what is optimized on the object level)
when the AI’s ontology changes and it needs to decide how to rebind value-laden concepts
when the AI encounters philosophical problems like Pascal’s mugging
Section 4 of Jeremy’s and Peter’s report also shows some more ways of how an AI might fail to learn the intended goal without being due to reward specification[1], though it doesn’t use your model-based RL frame.
Also, I don’t think A and B are exhaustive. Other somewhat speculative problems include:
A mesaoptimizer emerges under selection pressure and tries to gain control of the larger AI it is in while staying undetected. (Sorta like cancer for the mind of the AI.)
A special case of this might come from the AI trying to imagine another mind in detail, and the other mind might notice it is simulated and try to take control of the AI.
The AI might make a mistake when designing a more efficient successor AI on a different AI paradigm (especially because it may get pressured by humans into trying to do it quickly because of AI race), so the successor AI ends up with different values.
Other stuff I haven’t thought of now
Tbc, there’s no individual point where I think failure is overwhelmingly likely by default, but overall failure is disjunctive.
if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn’t trigger your innate reward (I think?).
I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.
Interesting that you think this.
Having quite good interpretability that we can use to give reward would definitely make me significantly more optimistic.
Though AIs might learn to think thoughts in different formats that don’t trigger negative reward, as e.g. in the “Deep deceptiveness” story.
- ^
Aka some inner alignment (aka goal misgeneralization) failure modes, though I don’t know whether I want to use those words, because it’s actually a huge bundle of problems.
- May 17, 2025, 12:03 PM; 13 points) 's comment on Towards_Keeperhood’s Shortform by (
- May 17, 2025, 9:00 AM; 3 points) 's comment on Consequentialism & corrigibility by (
Stuff I noticed so far from thinking about this:
Sensation of desire for closure.
Desire to appear smart (mostly in front of people with very good epistemics, where incentives are relatively aligned to truth-oriented thinking and criticizing others and changing one’s mind is incentivized but not overincentivized, but still).
When I think of a (new) piece of evidence/argument, my mind often initially over-updates into that direction for a minute or so, until I have integrated it into my overall model. (This happens in both directions. Aka I think my intuitive beliefs fluctuate more than makes sense from a Bayesian perspective, though I keep track on the meta level that I might not endorse my current intuitive pessimism/optimism about something and still need to evaluate it more neurally later.)
Keltham on Becoming more Truth-Oriented
The problem of finding a good representation of abstract thoughts
As background, here’s a simple toy model of thinking:
The goal is to find a good representation of the formal statements (and also the background knowledge) in the diagram.
The visual angle is sorta difficult, so the two easy criteria for figuring out what a good representation is, are:
1. Correspondance to language sentences
2. Well suited to do logical/probabilistic inferenceThe second criterion is often neglected. People in semantics often just take language sentences and see how they can write it so it looks like formal logic, without taking care that it’s well suited for doing logical/probabilistic inference, let alone specifying the surrounding knowledge that’s required for doing inference.
In my post “Introduction to Representing Sentences as Logical Statements”, I proposed that standard ways of formalizing events like Davidsonian event semantics are bad and that instead we just want to use temporally bounded facts. Here’s a clarification on according to which criterion my version is perhaps better[1]:
Davidsonian semantics (among other things) allows you to conveniently make it look like you explained how to formalize adverbials (“quickly”, “loudly”, “carefully”) by e.g. formalizing the sentence “Alice quickly went home” as:
∃e(Going(e) ∧ Agent(e, Alice) ∧ Goal(e, Home) ∧ Quick(e) ∧ Past(e))
This is a bug, not a feature. It gives you the illusion that you made progress on understanding language, but actually you only make progress if you’re explaining how a system can make useful inferences (or how a sentence can update a visual scene).
A more precise version of one of the claims from my post is basically that my temporally-bounded-facts way of treating events is closer to the deep formal representation that can be used for logical/probabilistic inference.
You can use the Davidsonian representation, but for actually explaining part of the meaning you need to add a lot of background knowledge for making inferences to other statements, and once you added background rules which I claim are basically like parsing rules to a deeper representation that uses only temporally bounded facts.
Tbc, the way I represent statements in my post is still not nearly sufficiently close to how our minds might actually track abstract information: Our minds make a lot more precise distinctions and have deeper probabilistic error-tolerant representations. Language sentences are only fuzzy shadows of our true underlying thoughts, and our minds infer a lot from context about what precisely is meant. The problem of parsing sentences into an actually good formal representation obviously becomes correspondingly harder.
- ^
For some reasons why it’s better, maybe see the “Events as facts” section in my post, though it’s not explained well. Though maybe it’s sorta intuitive given the clarified context.
- ^
Here’s a useful exercise Keltham gives after the lecture “the alien maths of dath ilan”.
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
From my perspective it looks like this:
If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so).
If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint.
So you want to try to embed the desire for helpfulness/corrigibility in the utility function.
If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.
Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, …
Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)