I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
(You did respond to all the important parts, rest of my comment is very much optional.)
My reading was that you still have an open disagreement where Steve thinks there’s not much more to explain but you still want an answer to “Why did people invent the word ‘consciousness’ and wrote what they wrote about it? What algorithm might output sentences describing fascination about the redness of red?” which Steve’s series doesn’t answer.
I wouldn’t give up that early on trying to convince Steve he’s missing some part. (Though possible that I misread Steve’s comment and he understood you, I didn’t read it precisely.)
Here’s the (obvious) strategy: Apply voluntary attention-control to keep S(getting out of bed) at the center of attention. Don’t let it slip away, no matter what.
Can you explain more precisely how this works mechanistically? What is happening to keep S(getting out of bed) in the center of attention.
8.5.6.1 Aside: The “innate drive to minimize voluntary attention control”
Your hypothesis here doesn’t seem to me to explain why we seem to have limited willpower budget for attention control which gets depleted but which also regenerates after a time. I can see how negative rewards from minimizing voluntary attention control can make us less likely to apply willpower in the future, but why would it regenerate then?
Btw, there’s another simpler possible mechanism, though I don’t know the neuroscience and perhaps Steve’s hypothesis with separate valence assessors and involuntary attention control fits the neuroscience evidence much better and it may also fit observed motivated reasoning better.
But the obvious way to design a mind would be to make it just focus on whatever is most important, aka where most expected utility per necessary resources could be gained.
So we still have a learned value function which assigns how good/bad something would be, but we also have an estimator of how much the value would increase if we continue thinking (which might e.g. happen because one makes plans for making a somewhat bad situation better), and what gets attended on depends on this estimator, not the value function directly.
The “S” in “S(X)” and “S(A)” seems different to me. If I rename the “S” in “S(A)” to “I”, it would make more sense to me:
A = action of standing up (which gets actually executed if positive valence)
I(A) = imagined scene of myself standing up
S(I(A)) = the thought “I am thinking about standing up”
Yeah I agree that it wouldn’t be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it’s not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.
(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)
I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
I didn’t read the lecture you linked, but I liked Hofstadter’s book “Surfaces and Essences” which had the same core thesis. It’s quite long though. And not about neuroscience.
I find this rather ironic:
6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?
It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.
[...]
On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.
(I guess I wouldn’t say it’s very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)
I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
Yeah thanks for distinguishing. It’s not at all obvious to me that Paul would call CIRL “corrigible”—I’d guess not, but idk.
My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It’s possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.
I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
Sorry that was very poorly phrased by me. What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”. So yes I agree. I still find it confusing though why people started calling that corrigibility.
In your previous comment you wrote:
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I don’t understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t. (Of course it doesn’t matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it’s not correctable, and I don’t see why you think it’s evidence.)
I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don’t know what you mean here.
Also, in my view corrigibility isn’t just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn’t:
If something goes wrong with CIRL so its goal isn’t pointed to the human utility function anymore, it would not want operators to correct it.
TheOne central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven’t found it yet.)E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).
Though another way you can keep being able to correct the AI’s goals is by having the AI not think much in the general domain about stuff like “the operators may change my goals” or so.
(Most of the corrigibility principles are about a different part of corrigibility, but I think this “be able to correct the AI even if something goes a bit wrong with its alignment” is a central part of corrigibility.)
I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.
Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.
Over the last 2-3 weeks I practiced this by setting a 3-minute timer every time I went on a walk, and when the timer rings I check what I’m thinking and backtrack how I got there and refresh the timer. I found it quite useful so far.
I now also want to start trying track-back meditations on simple problems I solved, and perhaps working myself tracking back how I solved harder problems.
Did you practice this further? Did you get more useful stuff out of it? Do you have further advice?
Yeah so I actually ended up to captivated by the question and attempted to investigate it quickly in the reasoning that if they are superhumanly smart that would be very useful to figure out. But there turned out to be some annyoing constraints that makes running an experiment difficult, and I later realized that they are very probably not smarter than the smartest humans, but I still think they are likely somewhere around human level.
If you notice your confusion between the difference of the act of weighing and gravity
This description seems imprecise/confusing to me. It’s rather that you need to notice that you need an extra assumption for inertial_mass=gravitational_mass, and then you can embark on finding a theory where those are identical by thinking stuff like “can i frame it as earth actually just accelerating up?”.
I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.
Yeah fair point. I don’t really know what Paul means with corrigibility. (One hypothesis: Paul doesn’t think in terms of consequentialist cognition but in terms of learned behaviors that generalize, and maybe the question “but does it behave that way because it wants the operator’s values to be fulfilled or because it just wants to serve?” seems meaningless from Paul’s perspective. But idk.)
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I’m pretty sure Eliezer would not want the term “corrigibility” to be used for the kind of correctability you get in the early stages of CIRL when the AI doesn’t already know its own values and how to accomplish them better than the operators. (Eliezer actually talked a bunch about this CIRL-like correctability in his 2001 report “Creating Friendly AI”. (Probably not worth your time to read, though given the context that it was 2001, there seemed to me to be some good original thinking going on there which I didn’t see often. Also you can see Eliezer being optimistic about alignment.))
And I don’t see it as evidence that Eliezer!corrigibility isn’t anti-natural.
(In the following I use “corrigibility” in the Eliezer-sense. I’m pretty confident that all of the following matches Eliezer’s model, but not completely sure.)
The motivation behind corrigibility was that aligning superintelligence seemed to hard, so we want to aim an AI to do a pivotal task that gets humanity on a course to likely properly aligning superintelligence later.
The corrigible AI would be just pointed to accomplish this task, and not to human values at all. It should be this bounded thing that only cares about this bounded task and afterwards shuts itself down. It shouldn’t do the task because it wants to accomplish human values and the task seems like a good way to accomplish it. Human values are unbounded, and it might be less likely shut itself down afterwards. Corrigibility has nothing to do with human values.
Roughly speaking, we can perhaps disentangle 3 corrigibility approaches:
Train for corrigible behavior.
I think Eliezer thinks that this will only create behavioral heuristics that won’t get integrated into the optimization target of the powerful optimizer, and the optimizer will see those as constraints to find ways around or remove. Since doing a pivotal act requires a lot of optimization power, it might find a way around those constraints, or use the nearest unblocked strategy which might still be undesireable.
(There might also be downsides of training for corrigible behavior, e.g. the optimization becoming less understandable and less predictable.)
Integrate corrigibility principles into the optimization.
These approaches are about trying to design the way the optimization works in ways that make it safer and less likely to blow up.
Coherent corrigibility / The hard problem of corrigibility.
If a solution here would be found it might have the shape of a utility function saying “serve the operators”. Not “serve because you want the operators values to be fulfilled”. (Less sure here whether I understand this correctly.)
I think Max Harms’ is trying to make some progress on this.
The main plan isn’t to try to get coherent corrigibility, but just to build something limited that optimizes in a way it can still get something pivotal done without wanting to take over the universe. Not that it has a coherent goal where the optimum wouldn’t be taking over the universe—it rather just doesn’t think those thoughts and just does its task.
Ideal would be something that doesn’t think in the general domain at all. E.g. imagine sth like AlphaFold 5 that isn’t trained on text at all and is only very good at modelling protein interactions, which could e.g. help us get relevant understanding about neuronal cell dynamics which we could use for significantly enhancing adult human intelligence - (I’m just sketching silly unrealistic sorta-concrete scenario). But seems unlikely we will able to do something impressive with narrow reasoners at our level of understanding.But even though we don’t aim for a coherent mind, if more parts that make the AI safe/corrigible have a coherent shape, e.g. if we find a working shutdown-utility function, that still improves safety, because it means those parts of the AI don’t obviously break in the limit of optimization pressure, so it’s also less probable to break through “only” pivotal levels of optimization.
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.)
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
From my perspective it looks like this:
If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so).
If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint.
So you want to try to embed the desire for helpfulness/corrigibility in the utility function.
If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, …
Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
I think Rohin’s misunderstanding about corrigibility, aka his notion of Paul!Corrigibility, doesn’t actually come from Paul but from the Risks from Learned Optimization (RFLO) paper[1]:
3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer’s epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer’s intentions).
It seems to me like the authors here just completely misunderstood what corrigibility is about. I think in their ontology, “corrigibly aligned to human values” just means “pointed at indirect normativity (aka human-CEV)”, aka indirectly caring about human values by valuing whatever they infer humans value (as opposed to directly valuing the same things as humans for the same complex reasons[2]).
(Paul’s post seems to me like he might have a correct understanding of corrigibility, and iiuc suggests corrigibility could also be used as avenue to aligning AI to human values, because we will be able to correct the AI for longer/at-higher-capability-levels if it is corrigible. EDIT: Actually not sure, perhaps he rather means that the AI will end up coherently corrigible from training for corrigibility, that it will converge to that even if we haven’t managed to write down a utility function for corrigibility.)
- ^
IIRC the RFLO paper also caused some confusion in me when I started learning about corrigibility.
- ^
Though not that this kind of “direct alignment” doesn’t necessarily correspond to what they call “internalized alignment”. Their ontology doesn’t make sense to me. (E.g. I don’t see what concretely Evan might mean with “the information came through the base optimizer”.)
- ^
Hi,
sorry for commenting without having read most of your post. I just started reading this and thought like “isn’t this exactly what the corrigibility agenda is/was about?”, and in your “relation to other agendas” section you don’t mention corrigibility there, so I thought I just ask whether you’re familiar with it and how your approach is different. (Though tbc, I could be totally misunderstanding, I didn’t read far.)
Tbc I think further work on corrigibility is very valuable, but if you haven’t looked into it much I’d suggest reading up on what other people wrote on that so far. (I’m not sure whether there are very good explainers, and sometimes people seem to get a wrong impression of what corrigibility is about. E.g. corrigibility has nothing to do with “corrigibly aligned” from the “Risks from Learned Optimization” paper. Also the shutdown problem is often misunderstood too. I would make read and try to understand the stuff MIRI wrote about it. Possibly parts of this conversation might also be helpful, but yeah sry it’s not written in a nice format that explains everything clearly.)
we may be able to avoid this problem by:
not building unbounded, non-interruptible optimizers
and, instead,building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that “we”, though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.
This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
If you have an unpublished draft, do you want to share it with me? I could then sometime the next 2 weeks read both your old post and the new one and think whether I have any more objections.
List of my LW comments I might want to look up again. I just thought I keep this list public on my shortform in case someone is unusually interested in stuff I write. I’ll add future comments here too. I didn’t include comments on my shortform here.:
First, apologies for my rant-like tone. I reread some MIRI conversations 2 days ago and maybe now have a bad EY writing style :-p. Not that I changed my mind yet, but I’m open to.
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
Sry should’ve clarified. I use “world” here as in “reality=the world we are in” and “counterfactual = a world we are not in”. Worlds can be formalized as computer programs (where the agent can be a subprogram embedded inside). Our universe/multiverse would also be a world, which could e.g. be described through its initial state + the laws of physics, and thereby encompass the full history. Worlds are conceivable trajectories, but not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds. Probably I’m bad at explaining.
I mean I think Eliezer endorses computationalism, and would imagine utilities as sth like “what subcomputations in this program do i find valuable?”. Maybe he thinks it’s usual that it doesn’t matter where a positive-utility-subcomputation is embedded within a universe. But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
Also, utility over outcomes actually doesn’t necessarily mean it’s just about a future state. You could imagine the outcome space including outcomes like “the amount of happiness units I received over all timesteps is N”, and maybe even more complex functions on histories. Though I agree it would be sloppy and confusing to call it outcomes.
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking.
Wasn’t thinking that.
And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right?
I don’t quite know. I think there are assumptions there about your preferences about the different kinds of pizza not changing over the time of the trades, and about not having other preferences about trading patterns, and maybe a bit more.
I agree that just having preferences about some future state isn’t a good formalism, and I can see that if you drop that assumption and allow “preferences over trajectories” the conclusion of the post might seem vacuous because you can encode anything with “utilities over trajectories”. But maybe the right way to phrase it is that we assume we have preferences over worlds, and those are actually somewhat more constrained through what kind of worlds are consistent. I don’t know. I don’t think the post is a great explanation.
I didn’t see the other links you included as significant evidence for your hypothesis (or sometimes not at all), and I think the corrigibility paper is more important.
But overall yeah, there is at least an assumption that utilities are about worlds. It might indeed be worth questioning! But it’s not obvious to me that you can have a “more broad preferences” proposal that still works. Maybe an agent is only able to do useful stuff in so far it has utilities over worlds, and the parts of it that don’t have that structure get money pumped by the parts that are.
I don’t know whether your proposal can be formalized as having utilities over worlds. It might be, but it might not be easy.
I don’t know whether you would prefer to take the path of “utilities over worlds is too constrained—here are those other preferences that also work”, or “yeah my proposal is about utilities over worlds”. Either way I think a lot more concreteness is needed.
Objection 2: What if the AI self-modifies to stop being corrigible? What if it builds a non-corrigible successor?
Presumably a sufficiently capable AI would self-modify to stop being corrigible because it planned to, and such a plan would certainly score very poorly on its “the humans will remain in control” assessment. So the plan would get a bad aggregate score, and the AI wouldn’t do it. Ditto with building a non-corrigible successor.
I should clarify what I thought you were claiming in the post:
From my perspective, there are 2 ways to justify corrigibility proposals:
Argue concretely why your proposal is sufficient to reach pivotal capability level while remaining safe.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I understood you as claiming your proposal fulfills the second thing.
Your answer to Objection 2 sounds to me pretty naive:
How exactly do you aggregate goal- and helpfulness-preferences? If you weigh helpfulness heavily enough that it stays safe, does it then become useless?
Might the AI still prefer plans that make it less likely for the human to press the shutdown button? If so, doesn’t it seem likely that the AI will take other actions that don’t individually seem too unhelpful and eventually disempower humans? And if not, doesn’t it mean the AI would just need to act on the standard instrumental incentives (except not-being-shut-down) of the outcome-based-goal, which would totally cause the operators to shut the AI down? Or how exactly is helpfulness supposed to juggle this?
And we’re not even getting started into problems like “use the shutdown button as outcome pump” as MIRI considered in their corrigibility paper. (And they considered more proposals privately. E.g. Eliezer mentions another proposal here.)
But maybe you actually were just imagining a human-level AI that behaves corrigibly? In which case I’m like “sure but it doesn’t obviously scale to pivotal level and you haven’t argued for that yet”.ADDED: On second thought, perhaps you were thinking the approach scales to working for pivotal-level brain-like AGI. This is plausible but by no means close to obvious to me. E.g. maybe if you scale brain-like AGI smart enough it start working in different ways than were natural, e.g. using lots of external programs to do optimization. And maybe then you’re like “the helpfulness accessor wouldn’t allow running too dangerous programs because of value-drift worries”, and then I’m like “ok fair seems like a fine assumption that it’s still going to be capable enough, but how exactly do you plan the helpfulness drive to also scale in capability as the AI becomes smarter? (and i also see other problems)”. Happy to try to concretize the proposal together (e.g. builder-breaker-game style).
Just hoping that you don’t get what seems to humans like weird edge instantiations seems silly if you’re dealing with actually very powerful optimization processes. (I mean if you’re annoyed with stupid reward specification proposals, perhaps try to apply that lens here?)
It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
So this seems to me like you get a utility score for the first, a utility score for the second, and you try to combine those in some way so it is both safe and capable. It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math, and you can try to strip down the fuzzy real-world parts by imagining toy environments that still capture the relevant essence. Which is what MIRI tried, and also what e.g. Max Harms tried with “empowerment”.
Concepts like “corrigibility” are often used somewhat used inconsistently. Perhaps you’re like “we can just let the AI do the rebinding to better definitions to corrigibility”, and then I’m like “It sure sounds dangerous to me to let a sloppily corrigible AI try to figure out how to become more corrigible, which involves thinking a lot of thoughts about how the new notion of corrigibility might break, and those thoughts might also break the old version of corrigibility. But it’s plausible that there is a sufficient attractor that doesn’t break like that, so let me think more about it and possibly come back with a different problem.”. So yeah your proposal isn’t obviously unworkable, but given that MIRI failed it’s apparently not as easy to find a concrete coherent version of corrigibility, and if we start out with a more concrete/formal idea of corrigibility it might be a lot safer.
ADDED:
And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI.
I previously didn’t clearly disentangle this, but what I want to discuss here are the corrigibility aspects of your proposal, not the alignment aspects (which I am also interested in discussing but perhaps separately on your other post). E.g. it’s fine if you assume some way to point the AI, like MIRI assumed we can set the utility function of the AI.
Even just for the corrigibility part, I think you’re being too vague and that it’s probably quite hard to get a powerful optimizer that has the corrigibility properties you imagine even if the “pointing to pattern match helpfulness” part works. (My impression was that you sounded relatively optimistic about this in your post, and that “research direction” mainly was about the alignment aspects.)
(Also I’m not saying it’s obvious that it likely doesn’t work, but MIRI failing to find a coherent concrete description of corrigibility seems like significant evidence to me.)
- 17 May 2025 12:03 UTC; 13 points) 's comment on Towards_Keeperhood’s Shortform by (
I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.
But it’s not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn’t work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like “eh probably it works fine but not sure”, but my current biggest doubt is “when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?”.
Let me try to understand in more detail how you imagine the AI to look like:
How does the learned value function evaluate plans?
Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just “the human walks toward the reward button” or also “getting more relevant knowledge is usually good”.
Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like “value of information”, “gaining instrumental resources”, etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
If there are deceptive plans like “ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup”, would the LVF detect that as being low expected button presses?
I can try to imagine in more detail about what may go wrong once I better see what you’re imagining.
(Also in case you’re trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)