For my take on that, see §2.4.1 immediately afterwards. (I framed it in terms of RLHF, but it also applies to constitutional AI etc.).
So to summarize, you think if we apply brain-like-AGI levels of optimization pressure, the LLM will get Goodharted? Plausible, but isn’t the same also true of, say, a visual classifier for spiders? Visual classifiers generally also have rare adversarial cases. But in that specific case, it’s still useful because spiders themselves are unlikely to apply a lot of adversarial pressure. Now, somone leaving, say, a toy spider on a string dangling wriggling in the breeze near my bed to play a trick on me might apply significant adversarial pressure, and might manage to fool my brainstem’s visual classifer and give me a nasty shock, but it’s a fairly rare problem. In general, learning processes in my cortext aren’t exerting a lot of their capability actively trying to exploit flaws in my brainstep’s visual classifiers (except possibly when they lead to putting Taylor Swift posters on my wall). So then LLM as a component in the steering subsystem for brain-like AGI would need to be used, if it was vulnerable to Goodharting, might need to be used in a way like the “spider visual classifier”, where it’s not often directly exposed to strong optimization pressure by the learning subsystem trying to optimize something that will Goodhart to exploit flaws in it?
I’m not sure what you mean here. What would it look like, in your opinion, for the human brainstem to not manage to supervise something a lot smarter than it is?
My opinion is that neither the learning nor steering subsystems are self-contained agents with goals, and so we shouldn’t talk about them as if they are.
Thus I don’t think it’s a good idea to anthropomorphize the steering subsystem. …But if we did so anyway, we would presumably say that the steering subsystem “wants” people to eat more food under certain conditions, and that this “desire” gets subverted when the person takes Ozempic. Ditto when the person takes Adderall, melatonin, caffeine, etc.
Agreed, they’re two different subsystems within something that’s only an agent because of the end result of a learning process based on the interaction between them, not two separate agents, so anthropomorphizing them is unhelpful. “Supervise” was probably the wrong word — I guess I really meant “act as the critic for”. I’m trying to think of this more like, say, a GAN, that a pair of agents.
My assumption is that failure modes would look like the learning subsystem Goodharting the steering subsystem and learning weird failure modes of it, without needing any technological trickery like Ozempic, Adderall, melatonin, caffeine, etc. Or indeed posters of Taylor Swift or realistic toy spiders on a string. (All of which are products of many human agents working together, not a single learning subsystem.)
The normal assumption I’ve seen people use is that when designing an architecture using two subsystems like this, where if one manages to Goodhart the other bad things would happen, is that they need to be of roughly equal capabilities. In GANs, for example, they basically only make forward progress at points where the two sides are relatively competitive, and one isn’t beating the other almost all the time. So possibly GANs are just not a great analogy here, as actor-critic obviously isn’t the same thing, thought there seems to be at least a loose analogy. But in the human brain, as you point out, it’s O(90%) learning subsystem and O(10%) steering subsystem, at least by volume, so that suggests their capabilities aren’t very balanced and differ by at least an order of magnitude, plus the specification for the steering subsystems needs to fit in the genome, which is a strong constraint on its design complexity (far, far tighter than, say, its synapse count, so clearly its synapses are not individually genetically coded, implying there’s probably some learning going on in them too). So why can the steering subsystem act as critic for an actor system with roughly an order of magnitude more brain volume? I was curious if you’d thought about how and why that works in humans. Your post on a possible mechanism for social drives suggests the steering subsystem is doing a form of learned interpretability that let its piggyback on some of the learning subsystem’s capacity, which would presumably help. And obviously that 90:10 ratio took quite a lot of millions of years to achieve from when it was ~ 50:50, though quite a lot of humans cortical expansion happened in the last few million years, so maybe the answer is partly just that we worked the bugs out slowly, or maybe we haven’t actually worked all the bugs out yet. But it’s striking that nature’s solution wasn’t to make the steering subsystem bigger too, and keep the ratio roughly constant. Maybe because of the genome-size constraint?
So to summarize, you think if we apply brain-like-AGI levels of optimization pressure, the LLM will get Goodharted? Plausible, but isn’t the same also true of, say, a visual classifier for spiders? Visual classifiers generally also have rare adversarial cases. But in that specific case, it’s still useful because spiders themselves are unlikely to apply a lot of adversarial pressure. Now, somone leaving, say, a toy spider on a string dangling wriggling in the breeze near my bed to play a trick on me might apply significant adversarial pressure, and might manage to fool my brainstem’s visual classifer and give me a nasty shock, but it’s a fairly rare problem. In general, learning processes in my cortext aren’t exerting a lot of their capability actively trying to exploit flaws in my brainstep’s visual classifiers (except possibly when they lead to putting Taylor Swift posters on my wall). So then LLM as a component in the steering subsystem for brain-like AGI would need to be used, if it was vulnerable to Goodharting, might need to be used in a way like the “spider visual classifier”, where it’s not often directly exposed to strong optimization pressure by the learning subsystem trying to optimize something that will Goodhart to exploit flaws in it?
Agreed, they’re two different subsystems within something that’s only an agent because of the end result of a learning process based on the interaction between them, not two separate agents, so anthropomorphizing them is unhelpful. “Supervise” was probably the wrong word — I guess I really meant “act as the critic for”. I’m trying to think of this more like, say, a GAN, that a pair of agents.
My assumption is that failure modes would look like the learning subsystem Goodharting the steering subsystem and learning weird failure modes of it, without needing any technological trickery like Ozempic, Adderall, melatonin, caffeine, etc. Or indeed posters of Taylor Swift or realistic toy spiders on a string. (All of which are products of many human agents working together, not a single learning subsystem.)
The normal assumption I’ve seen people use is that when designing an architecture using two subsystems like this, where if one manages to Goodhart the other bad things would happen, is that they need to be of roughly equal capabilities. In GANs, for example, they basically only make forward progress at points where the two sides are relatively competitive, and one isn’t beating the other almost all the time. So possibly GANs are just not a great analogy here, as actor-critic obviously isn’t the same thing, thought there seems to be at least a loose analogy. But in the human brain, as you point out, it’s O(90%) learning subsystem and O(10%) steering subsystem, at least by volume, so that suggests their capabilities aren’t very balanced and differ by at least an order of magnitude, plus the specification for the steering subsystems needs to fit in the genome, which is a strong constraint on its design complexity (far, far tighter than, say, its synapse count, so clearly its synapses are not individually genetically coded, implying there’s probably some learning going on in them too). So why can the steering subsystem act as critic for an actor system with roughly an order of magnitude more brain volume? I was curious if you’d thought about how and why that works in humans. Your post on a possible mechanism for social drives suggests the steering subsystem is doing a form of learned interpretability that let its piggyback on some of the learning subsystem’s capacity, which would presumably help. And obviously that 90:10 ratio took quite a lot of millions of years to achieve from when it was ~ 50:50, though quite a lot of humans cortical expansion happened in the last few million years, so maybe the answer is partly just that we worked the bugs out slowly, or maybe we haven’t actually worked all the bugs out yet. But it’s striking that nature’s solution wasn’t to make the steering subsystem bigger too, and keep the ratio roughly constant. Maybe because of the genome-size constraint?