RogerDearnaley comments on Foom & Doom 2: Technical alignment is hard

RogerDearnaley 20 Feb 2026 1:36 UTC
2 points
0
the reward function is written in Python (or whatever), not in natural language
What’s your opinion on LLM-as-judge, Constitutional AI, and so forth, where the specification is written in natural language? Those are becoming increasingly common. That seems like it fixes the literal genie issue just as well as using an LLM as agent does.

It does of course break down if the agent you’re applying the RL to is enough smarter than the judge to be able to find loopholes/schemes that the LLM judge just can’t figure out. If the judge is an LLM learning from humans, then even after extrapolating from documents like wikipedia articles and scientific papers that were written very slowly and carefully by multiple humans, and are better than any human could just write straight out, its maximum capability is probably limited to some multiple of human capacity. So this solution might reach low ASI, but not indefinitely far (unless we could figure out some way to recurse it.)

[On the other hand, this is using the normal assumption that the judge needs to be roughly as smart as what it’s supervising or ith will get tricked (like in a GAN), but the human brainstem clearly manages to supervise something a lot smarter than it is. Possibly because in brain-like AGI, they’re not two separate agents, but two parts of the same agent? Still, why doesn’t the learning subsystem in brain-like AGI learn ways to trick/manipulate the steering subsystem? Or does it?]

So, how would we recurse an LLM to get it to a higher capacity? Take off from there: build a value-learning version of AIXI (with approximate Bayesianism so it’s physically implementable). Ruthless consequentialism + Bayesian inference on a world model on a known-to-be-not-fully-known utility function, connected to ruthless consequentialism + Bauesian inference on figuring out what the correct Utility function is by learning human values. Clearly the latter becomes a high priority: assume the model starts smart enough to understand the statistical arguments for Goodharts Law, so understands that extrapolating out-of-distribution, i.e. outside the region that is well-predictable given the extent to which your approximate Bayesianism has narrowed the hypothesis set of your utility function, is a recipe for miscalculations unless you pessimize over the remaining Knightian (i.e. hypothesis) uncertainty, so it pessimizes correctly (fiddly, but basically well understood graduate level statistics), rather than naively extrapolation out of the space it understands. Thus it realizes that doing more value learning more would improve both its accuracy and the action space that it knows how to use safely, (Anything not smart enough to undersstand and do this couldn’t do scientific research anyway.)

Initialize the utility function half of that with everything an LLM knows about as humans and what we value (i.e. enough data to require many billions of parameters to store: it’s a very large Bayesian model: human values are complex. You can probably leave it implemented as an LLM, as they can do Bayesian inference), and use that as its initial priors. Then carefully define the mission statement for what value learning “researching human values” means, while using the LLM’s knowledge to avoid literal genie misreadings of that definition. Ground that definition scientifically/objectively in Evolutionary Psychology, not in something ill-defined an uncertain like Ethical Philosophy.

Would that version of ASI fail? Why? “Don’t act like a sociopath” is already baked into the starting priors of the utility function part of it. Value learning from humans shouldn’t unlearn that.

So basically, use an LLM as priors, then improve from there in a way that wires “your job is to figure out what the humans really want, and do that and only that” in as an RL objective.

For caution, start with it about as smart as the LLM was, so it can’t just immediately fool the “judge” that its utility function is acting as (and with it also smart enough to understand that doing so is just another way of Goodharting i.e. making a mistake), and ramp the capability up slowly during the value learning process, as it converges from “as well aligned as an LLM” towards “fully aligned”. Despite my describing this as a single training run, this ramp-up is probably more of a FOOM+simultaneous-value-learning.
- Steven Byrnes 24 Feb 2026 19:23 UTC
  2 points
  0
  Parent
  What’s your opinion on LLM-as-judge, Constitutional AI, and so forth, where the specification is written in natural language? Those are becoming increasingly common. That seems like it fixes the literal genie issue just as well as using an LLM as agent does.
  For my take on that, see §2.4.1 immediately afterwards. (I framed it in terms of RLHF, but it also applies to constitutional AI etc.).
  And see also §3.5 of a different post.
  the human brainstem clearly manages to supervise something a lot smarter than it is
  I’m not sure what you mean here. What would it look like, in your opinion, for the human brainstem to not manage to supervise something a lot smarter than it is?
  My opinion is that neither the learning nor steering subsystems are self-contained agents with goals, and so we shouldn’t talk about them as if they are.
  Thus I don’t think it’s a good idea to anthropomorphize the steering subsystem. …But if we did so anyway, we would presumably say that the steering subsystem “wants” people to eat more food under certain conditions, and that this “desire” gets subverted when the person takes Ozempic. Ditto when the person takes Adderall, melatonin, caffeine, etc.
  - RogerDearnaley 25 Feb 2026 2:26 UTC
    2 points
    0
    Parent
    For my take on that, see §2.4.1 immediately afterwards. (I framed it in terms of RLHF, but it also applies to constitutional AI etc.).
    So to summarize, you think if we apply brain-like-AGI levels of optimization pressure, the LLM will get Goodharted? Plausible, but isn’t the same also true of, say, a visual classifier for spiders? Visual classifiers generally also have rare adversarial cases. But in that specific case, it’s still useful because spiders themselves are unlikely to apply a lot of adversarial pressure. Now, somone leaving, say, a toy spider on a string dangling wriggling in the breeze near my bed to play a trick on me might apply significant adversarial pressure, and might manage to fool my brainstem’s visual classifer and give me a nasty shock, but it’s a fairly rare problem. In general, learning processes in my cortext aren’t exerting a lot of their capability actively trying to exploit flaws in my brainstep’s visual classifiers (except possibly when they lead to putting Taylor Swift posters on my wall). So then LLM as a component in the steering subsystem for brain-like AGI would need to be used, if it was vulnerable to Goodharting, might need to be used in a way like the “spider visual classifier”, where it’s not often directly exposed to strong optimization pressure by the learning subsystem trying to optimize something that will Goodhart to exploit flaws in it?
    I’m not sure what you mean here. What would it look like, in your opinion, for the human brainstem to not manage to supervise something a lot smarter than it is?
    My opinion is that neither the learning nor steering subsystems are self-contained agents with goals, and so we shouldn’t talk about them as if they are.
    Thus I don’t think it’s a good idea to anthropomorphize the steering subsystem. …But if we did so anyway, we would presumably say that the steering subsystem “wants” people to eat more food under certain conditions, and that this “desire” gets subverted when the person takes Ozempic. Ditto when the person takes Adderall, melatonin, caffeine, etc.
    Agreed, they’re two different subsystems within something that’s only an agent because of the end result of a learning process based on the interaction between them, not two separate agents, so anthropomorphizing them is unhelpful. “Supervise” was probably the wrong word — I guess I really meant “act as the critic for”. I’m trying to think of this more like, say, a GAN, that a pair of agents.
    My assumption is that failure modes would look like the learning subsystem Goodharting the steering subsystem and learning weird failure modes of it, without needing any technological trickery like Ozempic, Adderall, melatonin, caffeine, etc. Or indeed posters of Taylor Swift or realistic toy spiders on a string. (All of which are products of many human agents working together, not a single learning subsystem.)
    
    The normal assumption I’ve seen people use is that when designing an architecture using two subsystems like this, where if one manages to Goodhart the other bad things would happen, is that they need to be of roughly equal capabilities. In GANs, for example, they basically only make forward progress at points where the two sides are relatively competitive, and one isn’t beating the other almost all the time. So possibly GANs are just not a great analogy here, as actor-critic obviously isn’t the same thing, thought there seems to be at least a loose analogy. But in the human brain, as you point out, it’s O(90%) learning subsystem and O(10%) steering subsystem, at least by volume, so that suggests their capabilities aren’t very balanced and differ by at least an order of magnitude, plus the specification for the steering subsystems needs to fit in the genome, which is a strong constraint on its design complexity (far, far tighter than, say, its synapse count, so clearly its synapses are not individually genetically coded, implying there’s probably some learning going on in them too). So why can the steering subsystem act as critic for an actor system with roughly an order of magnitude more brain volume? I was curious if you’d thought about how and why that works in humans. Your post on a possible mechanism for social drives suggests the steering subsystem is doing a form of learned interpretability that let its piggyback on some of the learning subsystem’s capacity, which would presumably help. And obviously that 90:10 ratio took quite a lot of millions of years to achieve from when it was ~ 50:50, though quite a lot of humans cortical expansion happened in the last few million years, so maybe the answer is partly just that we worked the bugs out slowly, or maybe we haven’t actually worked all the bugs out yet. But it’s striking that nature’s solution wasn’t to make the steering subsystem bigger too, and keep the ratio roughly constant. Maybe because of the genome-size constraint?