Towards_Keeperhood

Karma: 1,002

Simon Skade

I did (mostly non-prosaic) alignment research between Feb 2022 and Aug 2025. (Won $10k in the ELK contest, participated in MLAB and SERI MATS 3.0 & 3.1, then independent research. I mostly worked on an ambitious attempt to better understand minds to figure out how to create more understandable and pointable AIs. I started with agent foundations but then developed a more sciency agenda where I also studied concrete observations from language/linguistics, pychology, (neuroscience—though didn’t study much here yet), and from tracking my thoughts on problems I solved (aka a good kind of introspection).)

I’m now exploring advocacy for making it more likely that we get sth like the MIRI treaty (ideally with a good exit plan like human intelligence augmentation, or possibly an alignment project with actually competent leadership).

Currently based in Germany.

Towards_Keeperhood 12 Jan 2026 22:38 UTC
3 points
0
in reply to: Kaarel’s comment on: The Plan − 2025 Update
wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!
Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.

Towards_Keeperhood 3 Jan 2026 9:04 UTC
1 point
0
in reply to: Seth Herd’s comment on: [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems
Relatedly, you implicitly equate alignment with value alignment.
No, the first 3 difficulties I explain were mainly written with sth like helpfulness/instruction-following/DWIM in mind. I think corrigibility would be an even better target for RL based AI, although I didn’t want to need to explain it in this post. I wrote:
Maybe “do what the human wants” seems simple to you? But what does this actually mean on a level that’s a bit closer to math—how might a critic evaluating this look like?
The way I think of it, “what the human wants” refers to what the human would like if they knew all the consequences of the AI’s actions. The model will surely be able to make good predictions here, but the concept seems more complex than predicting whether the human will like some text. And predicting whether the human will like some text predicts reward even better!
Maybe “follow instructions as intended” seems simple to you? Try to unpack it—how could the critic be constructed to evaluate how instruction-following a plan is, and how complex is this?
Only the last problem was specifically about value alignment, because it looks like something like CEV might be needed for an AI whose intelligence can increase arbitrarily. Or at least it’s unclear helpfulness/instruction-following would generalize if you crank up intelligence very high.
I totally agree that we currently shouldn’t aim for CEV.

Towards_Keeperhood 3 Jan 2026 8:42 UTC
6 points
0
in reply to: habryka’s comment on: [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems
There is another LW wikitag here, which includes:
Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
But I totally agree that CEV is a useful concept to have. Also Yudkowsky’s later writing (like the Arbital post presumably around 2016) should trump his earlier take in 2004. Or maybe the meaning of CEV shifted a bit over the years from sth more specific to a very indirect pointer. Idk, I don’t remember the original CEV paper well.

Towards_Keeperhood 1 Jan 2026 23:32 UTC
13 points
2
on: The Plan − 2025 Update
I’d be curious about how your timelines updated. Last year you wrote:
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.
To me it seems plausible that we’re in some intermediate world where progress continues but we still have like 5 years to criticality.

Towards_Keeperhood 1 Jan 2026 23:29 UTC
11 points
4
on: The Plan − 2025 Update
Thanks for your yearly update!
On the plan:
- What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
The standard alignment by default story goes:
- Suppose the natural abstraction hypothesis^[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
- … Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
- … So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
- … Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.

Towards_Keeperhood 23 Dec 2025 10:04 UTC
4 points
0
in reply to: Mateusz Bagiński’s comment on: Alexander Müller’s Shortform
I meant SFF. No idea what was up with my typing circuits.

Towards_Keeperhood 22 Dec 2025 19:50 UTC
2 points
0
in reply to: Caleb Biddulph’s comment on: Alexander Müller’s Shortform
Does SSD have fixed or flexible budget? It could be that the bottleneck to Jaan Tallinn’s spending is rather how many good options there will be to donate to, rather than his budget.

Towards_Keeperhood 19 Dec 2025 10:14 UTC
1 point
0
on: Keltham’s Lectures in Project Lawful
I also recently listened to the planecrash chapter “the meeting of their minds” and while it’s not a lecture it does contain a lot of interesting insights. May seem like weird anthropics brainfuck to some people though. And it definitely contains spoilers.
PS: also check out this lecture. (EDIT: This is mostly “how to relate to beliefs” + “what the truth can destroy”, and then a short section that’s not linked in the post here.)
PPS: Also check out these insights from dath ilan.

Towards_Keeperhood 4 Dec 2025 13:49 UTC
LW: 28 AF: 14
0
AF
on: 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
Do you think sociopaths are sociopaths because their approval reward is very weak? And if so, why do they often still seek dominance/prestige?

Towards_Keeperhood 20 Nov 2025 10:32 UTC
1 point
0
in reply to: Jeremy Gillen’s comment on: Serious Flaws in CAST
I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
I just mean that I don’t plan for corrigibility to scale that far anyway (see my other comment), and maybe we don’t need a paradigm shift to get to the level we want, so it’s mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn’t all that real so multiple small updates might lead us out of the basin. It just didn’t seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
What actions help you land in the basin?
Clarifying the problem first: Let’s say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
1. We don’t want there to be a “will I get reward” node in the world model. In the beginning the agent shouldn’t know it is an RL agent or how it is trained.
  1. Also make sure it doesn’t know about thought monitoring etc. in the beginning.
2. The operators should be careful to not give visible signs that are strongly correlated to giving reward, like smiling or writing “good job” or “great” or “thanks” or whatever. Else the agent may learn to aim for those proxies instead.
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn’t ask for.
And then you also want to use some thought monitoring. If the agent doesn’t reason in CoT, we might still be able to train some translators on the neuralese. We can:
1. Train the world model (aka thought generator) to think more in terms of concepts like the principal, short-term preferences, actions/instructions of the principal, power/influence.
2. Give rewards directly based on the plans the AI is considering (rather than just from observing behavior).
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
1. Paul seems to guess that there may be a crisp difference in corrigible behaviors vs incorrigible ones. One could interpret that as a hope that there’s sth like a local optimum in model space around corrigibility, although I guess that’s not fully what Paul thinks here. Paul also mentions the ELK continuity proposal there, which I think might’ve developed into Mechanistic Anomaly Detection. I guess the hope there is that to get to incorrigible behavior there will be a major shift in how the AI reasons. E.g. if before the decisions were made using the corrigible circut, and now it’s coming from the reward-seeking circut. So perhaps Paul thinks that there’s a basin for the corrigibility circut, but that the creation of other circuts is also still incentivized and that a shift to those needs to be avoided through disincentivizing anomalies?
  1. Idk seems like a good thing to try, but seems quite plausible we would then get a continuous mechanistic shift towards incorrigibility. I guess that comes down to thinking there isn’t a crisp difference between corrigible thinking and incorrigible thinking. I don’t really understand Paul’s intuition pump for why to expect the difference to be crisp, but not sure, it could be crisp. (Btw, Paul admits it could turn out not to be crisp). But even then the whole MAL hope seems sorta fancy and not really the sort of thing I would like to place my hopes on.
2. The other basin-like property comes from agents that already sorta want to empower the principal, to want to empower the principal even more, because this helps empower the principal. So if you get sorta-CAST into an agent it might want to become better-CAST.

Towards_Keeperhood 19 Nov 2025 19:53 UTC
7 points
0
on: Serious Flaws in CAST
But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we’re “in the attractor basin” and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you’re trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there’s a big chance of kicking us out of “the basin of attraction.”
Idk this doesn’t really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don’t need big changes.
That said, I’m quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
1. stable basin: The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
2. unstable basin: The slightly sycophantic patterns in the reasoning processes of the AI cause the AI to get more reward, pushing the AI further towards sycophancy and incorrigibility.
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there’s of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn’t seem implausible to me that we could try in a good way and hit it.

Towards_Keeperhood 19 Nov 2025 19:22 UTC
4 points
0
on: Serious Flaws in CAST
Nice post! And being scared of minus signs seems like a nice lesson.
Absent a greater degree of theoretical understanding, I now expect the feedback loop of noticing and addressing flaws to vanish quickly, far in advance of getting an agent that has fully internalized corrigibility such that it’s robust to distributional and ontological shifts.
My motivation for corrigibility isn’t that it scales all that far, but that we can more safely and effectively elicit useful work out of corrigible AIs than out of sycophants/reward-on-the-episode-seekers (let alone schemers).
E.g. current approaches to corrigibility still rely on short-term preferences, but when the AI gets smarter and its ontology drifts so it sees itself as agent embedded in multiple places in greater reality, short-term preferences become much less natural. This probably-corrigibility-breaking shift already happens around Eliezer level if you’re trying to use the AI to do alignment research. Doing alignment research makes it more likely that such breaks occur earlier, also because the AI would need to reason about stuff like “what if an AI reflects on itself in this dangerous value-breaking way” which is sorta close to the AI reflecting itself in that way. Not that it’s necessarily impossible to use corrigible AI to help with alighment research, but we might be able to get a chunk further in capability if we make the AI not think about alignment stuff and instead just focus on e.g. biotech research for human intelligence augmentation, and that generally seems like a better plan to me.
I’m pretty unsure, but I currently think that if we tried not too badly (by which I mean much better than any of the leading labs seem on track to try, but not requiring fancy new techniques), we may have sth like a 10-75%^[1] chance of getting a +5.5SD corrigible AI. And if a leading lab is sane enough to try a well-worked-out proposal here and it works, it might be quite useful to have +5.5SD agents inside of the labs that want to empower the overseers and at least can tell them that all the current approaches suck and we need to aim for international cooperation to get a lot more time (and then maybe human augmentation). (Rather than having sycophantic AIs that just tell the overseers what they want to hear.)
So I’m still excited about corrigibility even though I don’t expect it to scale.
$p o w e r^{'} (x) = E_{v \sim Q (V), v^{'} \sim Q (V), a \sim P (A | x, v), d \sim P (D | x, v^{'}, a)} [v (d)]$
Restructuring it this way makes it more attractive for the AI to optimize things according to typical/simple values if the human’s action doesn’t sharply identify their revealed preferences. This seems bad.
The way I would interpret “values” in your proposal is like “sorta-short-term goals a principle might want to get fulfilled”. I think it’s probably fine if we just learn a prior over what sort of sorta-short-term goals a human may have, and then use that prior instead of Q. (Or not?) If so, this notion of power seems fine to me.
(If you have time, I also would be still interested in your rough take on my original question.)
1. ^
  (wide range because I haven’t thought much about it yet)

Towards_Keeperhood 18 Nov 2025 8:22 UTC
1 point
0
on: Can we safely automate alignment research?
Behavioral science of generalization. The first is just: studying AI behavior in depth, and using this to strengthen our understanding of how AIs will generalize to domains that our scalable oversight techniques struggle to evaluate directly.
- Work in the vicinity of “weak to strong” generalization is a paradigm example here. Thus, for example: if you can evaluate physics problems of difficulty level 1 and 2, but not difficulty level 3, then you can train an AI on level 1 problems, and see if it generalizes well to level 2 problems, as a way of getting evidence about whether it would generalize well to level 3 problems as well.
  (This doesn’t work on schemers, or on other AIs systematically and successfully manipulating your evidence about how they’ll generalize, but see discussion of anti-scheming measures below.)
I don’t think this just fails with schemers. A key problem is that it’s hard to distinguish whether you’re measuring “this alignment approach is good” or “this alignment approach looks good to humans”. If it’s the latter, it looks great on level 1 and 2 but then the approach for 3 doesn’t actually work. I unfortunately expect that if we train AIs to evaluate what is good alignment research, they will more likely learn the latter. (This problem seems related to ELK.)

Towards_Keeperhood 17 Nov 2025 11:53 UTC
3 points
2
on: How human-like do safe AI motivations need to be?
they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy.
I don’t think Nate or Eliezer were expecting seeing bad cases this early, and I don’t think seeing bad cases updated them much further towards pessimism—they were already pretty pessimistic before. I don’t think they update in a non-Bayesian way as you seem to think, it’s just that AIs being nice in new circumstances isn’t much evidence for alignment being easy given their models.
I think thinking in terms of behavior generalization is a bad frame for thinking about what really smart AIs will do. You rather need to think in terms of optimization / goal-directed reasoning. E.g. if you imagine a reward maximizer, it’s 0 surprising that it works well while it cannot escape control measures, but when it is really smart so it can, it’s not surprising that it will.

Towards_Keeperhood 17 Nov 2025 11:52 UTC
3 points
0
on: How human-like do safe AI motivations need to be?
Thanks for writing up your views in detail!
On corrigibility:
Corrigibility was originally intended to mean that a system that has that property does not run into nearest unblocked strategy problems, unlike the kind of adversarial dynamic that exists between deontological and consequentialist preferences. In your version, the consequentialist planning to fulfill a hard task given by the operators is at odds with the deontological constraints.
I also think it is harder to get robust deontological preferences into an AI than one would expect given human intuitions. The way the human reward system is wired in a way that we robustly get positive rewards for pro-social self-reflective thoughts. Perhaps we can have another AI monitor the thoughts of our main AI and likewise reward pro-social (self-reflective?) thoughts (although I think LLM-like AIs would likely be self-reflective in a rather different way than humans). However, I think for humans our main preferences come from such approval-directed self-reflective desires, whereas I expect the way people will train AI by default will cause the main smart optimization to aim for object-level outcomes, which are then more at odds with norm-following. (See this post, especially sections 2.3 and 3.) (And also even for humans it’s not quite clear whether deontological/norm-following preferences are learned deeply enough.)
So basically, I don’t expect the main optimization of the AI to end up robustly steering in a way to fulfill deontological preferences, and it’s rather like trying to enforce deontological preferences by having a different AI monitor the AI’s thoughts and constraining it to not think virtue-specification-violating thoughts. So when the AI gets sufficiently smart you get (1) nearest-unblocked-strategy problems, like non-AI-interpretable disobedient thoughts; and/or (2) collusion if you didn’t find some other way to make your AIs properly corrigible.
Deontological preferences aren’t very natural targets for a steering process to steer toward. It somehow sometimes works out for humans because their preferences derive more from their self-image, rather than from environmental goals. But if you try to train deontological preferences into an AI with current methods, it won’t end up deeply internalizing those, but rather learns a belief that it should not think disobedient thoughts, or an outer-shell non-consequentialist constraint.
(I guess given that you acknowledge nearest-unblocked strategy problems, you might sorta agree with this, though still plausible to me that you overestimate how deep and generalizing trained-in deontological constraints would be.)
Myopic instruction-following is already going into a roughly good direction in terms of what goal to aim for, but I think if you give the AI a task, the steering process towards that task would likely not have a lot of nice corrigibility properties by default. E.g. it seems likely that in steering toward such a task, it would see the possibility of the operator telling it to stop as obstacle that would prevent the goal of task-completion. (I mean it’s a bit ambiguous how exactly to interpret instruction-following here, but I think that’s what you get by default if you naively try to train for it as current labs would.)
It would be much nicer if the powerful steering machinery isn’t steering in a way that it would naturally disempower us (absent thought and control constraints) in the first place. I think aiming for CAST would be much better: Basically, we want to point the powerful steering machinery towards a goal like “empower the principal” which then implies instruction-following and keeping the principal in control^[1]. It also has the huge advantage that steering towards roughly-CAST may be enough for the AI to want to empower the principal more, so it may try to change itself into sth like more-correct-CAST (aka Paul Christiano’s “basin of corrigibility”). (But obviously difficulties of getting the intended target instead of sth like reward-seeking AI still apply.)
1. ^
  Although not totally sure whether it works robustly, but in any case seems much much much better to aim for than sth like Anthropic’s HHH.

Towards_Keeperhood 16 Nov 2025 13:14 UTC
1 point
0
on: The Unreasonable Effectiveness of Fiction
I heard you mention in the doom debates podcast that you’re working on an audiobook but it “may take a while”. Could you give a quantitative guess for how long?

Towards_Keeperhood 14 Nov 2025 8:46 UTC
1 point
0
on: Giving AIs safe motivations
Do you count avoiding reward-on-the-episode-seekers as part of step 2 or step 3?

Towards_Keeperhood 8 Nov 2025 18:41 UTC
3 points
0
in reply to: Max Harms’s comment on: 3b. Formal (Faux) Corrigibility
Thanks!
The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn’t prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
(Btw, even if it doesn’t work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)

Towards_Keeperhood 7 Nov 2025 9:49 UTC
1 point
0
on: Plans A, B, C, and D for misalignment risk
Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):^[4]
- Plan A: 7%
- Plan B: 13%
- Plan C: 20%
- Plan D: 45%
- Plan E: 75%
I think it makes more sense to state overall risk instead of takeover risk, because that’s what we care about. Could you give very rough guesses on what fraction of achievable utility we would get in expectation conditional on each Plan? (“achievable utility” is the utility we would get if the future goes optimally, like CEV aligned superintelligence.) Or just roughly break down how good you expect the non-takeover worlds to be?
I’m a lot more pessimistic than you, and I’m currently trying to get a better understand the world model of people like you. Some questions I have: (Feel free to link me to other resources.)
1. Do you agree that we need to model AIs as goal-directed reasoners and make sure they are steering towards the right outcomes? Or does some hope come from worlds where AIs just end up nice through behavioral generalization or so?
2. There seems to be a lot of work on avoiding schemers, but on my model, if we don’t get a schemer (which is quite plausible) we very likely get a sycophant, which I think still kills us at very high levels of intelligence. Unless we make a relatively good attempt at training corrigibility (which IMO sorta at least requires Plan C), getting a saint seems very unlikely to me. Do you hope to get a saint, or are you fine with a sycophant that you can use to get even better coordination between leading actors to then transition to a better plan^[1] or so? Or what are your thoughts?
3. (If you know of any success stories that are relatively concrete about how the AI was trained and which seem somewhat plausible to you, I would be interested in reading them.)
1. ^
  Although I think that wouldn’t really explain the 25% non-takeover chance on Plan E?

Towards_Keeperhood 7 Nov 2025 8:31 UTC
1 point
0
on: How will we update about scheming?
Thanks for giving a legible account of your probability distribution that dominate top human experts.
- 1:1 very good: top human expert at opaque goal-directed reasoning with the relevant context with a bunch of time to think _(My modal view)_
Could you clarify what you mean by “a bunch of time to think” here? 30 minutes? 1 week? roughly indefinite?
(Btw: Top human experts is quite vague and problematic because the human capability distribution is often quite heavy-tailed. And I am most interested in the question of whether AIs will be scheming when they start to be able to AI safety research as well as the best humans, which I think may be a chunk after they surpass humans at capabilities research, since safety research seems less verifiable and people will try less hard to automate safety. On my model that does make success a decent amount harder than if the bar is “ordinary top human experts”.)