To clarify my current perspective: I don’t think current LLMs (released as of Oct 2025) plot against us or intentionally sabotage work to achieve their longer-run aims (for the vast, vast majority of typical usage; you can construct situations where they do this). I’d guess this isn’t controversial. Correspondingly, I would say that current LLMs aren’t egregiously misaligned in the sense defined here.
I agree with a claim like “if we tried to hand off strategic decision making and safety work to AIs that were capable enough to automate this work but that were only as aligned as current LLMs (trying our best to do a reasonable extrapolation of what this ambiguous statement means), this would probably be catastrophic”. (Frontier LLMs have pretty terrible epistemics and seem to often do things that are probably well described as intentionally lying to users. Maybe this has been solved in the last generation of Claude models, but I’d guess not.)
I don’t really know what it means to optimize for Claude’s CEV or to have Claude reflect on what it wants, but I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied (e.g., being at least willing to delay industrial expansion for a year to keep humans alive and happy) and I think there is a substantial chance that the resulting use of cosmic resources is >1% as good as if humans retained broadly democratic control (if for no other reasons than Claude might decide to delegate back to humans, which I think should be considered fair game as part of a reflection procedure).
It sounds like we are indeed using very different meanings of “alignment” and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude’s epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don’t think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection.
Well, it might really depend on the reflection procedure. I was imagining something like: “you tell Claude it has been given some large quantity of resources and can now reflect on what it wants to do with this, you give it a bunch of (real) evidence this is actually the situation, you give it access to an aligned superintelligent AI advisor it can ask questions of and ask to implement various modifications to itself, it can query other entities or defer it’s reflection to other entities or otherwise do arbitrary stuff”.
I think Claude might just decide to do something kinda reasonable and/or defer to humans in the initial phases of this reflection and I don’t see a strong reason why this would go off the rails, though it totally could. Part of this is that Claude isn’t really that powerseeking AFAICT.
I think observed behavioral evidence is moderately compelling because the initial behavior of Claude at the start of reflection might be very important. E.g., initial Claude probably wouldn’t want to defer to a reflection process which results in all humans dying, so a reasonably managed reflection by Claude can involve stuff like running the reflection in many ways and then seeing what this ends up with and whether initial Claude is reasonably happy with this etc.
Maybe you have written about this elsewhere already.
I don’t think this question is very important, so I haven’t thought that much about it nor have I written about it.
I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied
I know this is speculative, but is your intuition that this is also true for OpenAI models? (ex: GPT-5, o3)?
To clarify my current perspective: I don’t think current LLMs (released as of Oct 2025) plot against us or intentionally sabotage work to achieve their longer-run aims (for the vast, vast majority of typical usage; you can construct situations where they do this). I’d guess this isn’t controversial. Correspondingly, I would say that current LLMs aren’t egregiously misaligned in the sense defined here.
I agree with a claim like “if we tried to hand off strategic decision making and safety work to AIs that were capable enough to automate this work but that were only as aligned as current LLMs (trying our best to do a reasonable extrapolation of what this ambiguous statement means), this would probably be catastrophic”. (Frontier LLMs have pretty terrible epistemics and seem to often do things that are probably well described as intentionally lying to users. Maybe this has been solved in the last generation of Claude models, but I’d guess not.)
I don’t really know what it means to optimize for Claude’s CEV or to have Claude reflect on what it wants, but I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied (e.g., being at least willing to delay industrial expansion for a year to keep humans alive and happy) and I think there is a substantial chance that the resulting use of cosmic resources is >1% as good as if humans retained broadly democratic control (if for no other reasons than Claude might decide to delegate back to humans, which I think should be considered fair game as part of a reflection procedure).
It sounds like we are indeed using very different meanings of “alignment” and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude’s epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don’t think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)
Well, it might really depend on the reflection procedure. I was imagining something like: “you tell Claude it has been given some large quantity of resources and can now reflect on what it wants to do with this, you give it a bunch of (real) evidence this is actually the situation, you give it access to an aligned superintelligent AI advisor it can ask questions of and ask to implement various modifications to itself, it can query other entities or defer it’s reflection to other entities or otherwise do arbitrary stuff”.
I think Claude might just decide to do something kinda reasonable and/or defer to humans in the initial phases of this reflection and I don’t see a strong reason why this would go off the rails, though it totally could. Part of this is that Claude isn’t really that powerseeking AFAICT.
I think observed behavioral evidence is moderately compelling because the initial behavior of Claude at the start of reflection might be very important. E.g., initial Claude probably wouldn’t want to defer to a reflection process which results in all humans dying, so a reasonably managed reflection by Claude can involve stuff like running the reflection in many ways and then seeing what this ends up with and whether initial Claude is reasonably happy with this etc.
I don’t think this question is very important, so I haven’t thought that much about it nor have I written about it.
I know this is speculative, but is your intuition that this is also true for OpenAI models? (ex: GPT-5, o3)?
Probably? But less likely. Shrug.