I agree with the specific claims in this post in context, but the way they’re presented makes me wonder if there’s a piece missing which generated that presentation.
And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
It is correct to say that, if you know nothing about the nature of the system’s execution, this kind of natural language query is very little information. A deceptive system could output exactly the same thing. It’s stronger evidence that the system isn’t an agent that’s aggressively open with its incorrigibility, but that’s pretty useless.
If you somehow knew that, by construction of the underlying language model, there was a strong correlation between these sorts of natural language queries and the actions taken by a candidate corrigible system built on the language model, then this sort of query is much stronger evidence. I still wouldn’t call it strong compared to a more direct evaluation, but in this case, guessing that the maybeCorrigibleBot will behave more like the sample query implies is reasonable.
In other words:
Me: Yet more symbol-referent confusion! In fact, this one is a special case of symbol-referent confusion which we usually call “gullibility”, in which one confuses someone’s claim of X (the symbol) as actually implying X (the referent).
If you intentionally build a system where the two are actually close enough to the same thing, this is no longer a confusion.
If my understanding of your position is correct: you wouldn’t disagree with that claim, but you would doubt there’s a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability. You would expect many simple test cases with current systems like RLHF’d GPT4 in an AutoGPT-ish scaffold with a real shutdown button to work but would consider that extremely weak evidence about the safety properties of a similar system built around GPT-N in the same scaffold.
If I had to guess where we might disagree, it would be in the degree to which language models with architectures similar-ish to current examples could yield a system with properties that permit corrigibility. I’m pretty optimistic about this in principle; I think a there is a subset of predictive training that yields high capability with an extremely constrained profile of “values” that make the system goal agnostic by default. I think there’s a plausible and convergent path to capabilities that routes through corrigible-ish systems by necessity and permits incremental progress on real safety.
I’ve proven pretty bad at phrasing the justifications concisely, but if I were to try again: the relevant optimization pressures during the kinds of predictive training I’m referring to directly oppose the development of unconditional preferences over external world states, and evading these constraints carries a major complexity penalty. The result of extreme optimization can be well-described by a coherent utility function, but one representing only a conditionalized mapping from input to output. (This does not imply or require cognitive or perceptual myopia. This also does not imply that an agent produced by conditioning a predictor remains goal agnostic.)
A second major piece would be that this subset of predictors also gets superhumanly good at “just getting what you mean” (in a particular sense of the phrase) because it’s core to the process of Bayesian inference that they implement. They squeeze enormous amount of information out of every available source of conditions and stronger such models do even more. This doesn’t mean that the base system will just do what you mean, but it is the foundation on which you can more easily build useful systems.
There are a lot more details that go into this that can be found in other walls of text.
On a meta level:
That conversation we just had about symbol/referent confusions in interpreting language model experiments? That was not what I would call an advanced topic, by alignment standards. This is really basic stuff. (Which is not to say that most people get it right, but rather that it’s very early on the tech-tree.) Like, if someone has a gearsy model at all, and actually thinks through the gears of their experiment, I expect they’ll notice this sort of symbol/referent confusion.
I’ve had the occasional conversation that, vibes-wise, went in this direction (not with John).
It’s sometimes difficult to escape that mental bucket after someone pattern matches you into it, and it’s not uncommon for the heuristic to result in one half the conversation sounding like this post. There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I’m making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.
This isn’t to say “and therefore you should put enormous effort reading the manifesto of every individual who happens to speak with you and never use any conversational heuristics,” but I worry there’s a version of this heuristic happening at the field level with respect to things that could sound like “language models solve corrigibility and alignment.”
If my understanding of your position is correct: you wouldn’t disagree with that claim, but you would doubt there’s a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability.
Yes, though that’s separate from the point of the post.
The post is not trying to argue that corrigibility in LLMs is difficult, or that demonstrating (weak) corrigibility in LLMs is difficult. The post is saying that certain ways of measuring corrigibility in LLMs fail to do so, and people should measure it in a way which actually measures what they’re trying to measure.
In particular, I am definitely not saying that everyone arguing that LLMs are corrigible/aligned/etc are making the mistake from the post.
There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I’m making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.
I indeed worry about this failure-mode, and am quite open to evidence that I’m mis-modeling people.
(In practice, when I write this sort of thing, I usually get lots of people saying “man, that’s harsh/inconsiderate/undiplomatic/etc” but a notable lack of people arguing that my model-of-other-people is wrong. I would be a lot happier if people actually told me where my model was wrong.)
I mean, fundamentally, I think if someone offers X as evidence of Y in implicit context Z, and is correct about this, but makes a mistake in their reasoning while doing so, a reasonable response is “Good insight, but you should be more careful in way M,” rather than “Here’s your mistake, you’re gullible and I will recognize you only as student,” with zero acknowledgment of X being actually evidence for Y in implicit context Z.
Suppose someone had endorsed some intellectual principles along these lines:
Same thing here. If you measure whether a language model says it’s corrigible, then an honest claim would be “the language model says it’s corrigible”. To summarize that as “showing corrigibility in a language model” (as Simon does in the first line of this post) is, at best, extremely misleading under what-I-understand-to-be ordinary norms of scientific discourse....
Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence “person says X” gives about “X”, rather than the claimant making that decision on everybody else’ behalf and then trying to propagate their conclusion.
I think applying this norm to judgements about people’s character straightforwardly means that it’s great to show how people make mistakes and to explain them; but the part where you move from “person A says B, which is mistaken in way C” to “person A says B, which is mistaken in way C, which is why they’re gullible” is absolutely not good move under the what-I-understand-to-be-ordinary norms of scientific discourse.
Someone who did that would be straightforwardly making a particular decision on everyone else’s behalf and trying to propagate their conclusion, rather than simply offering evidence.
I agree with the specific claims in this post in context, but the way they’re presented makes me wonder if there’s a piece missing which generated that presentation.
It is correct to say that, if you know nothing about the nature of the system’s execution, this kind of natural language query is very little information. A deceptive system could output exactly the same thing. It’s stronger evidence that the system isn’t an agent that’s aggressively open with its incorrigibility, but that’s pretty useless.
If you somehow knew that, by construction of the underlying language model, there was a strong correlation between these sorts of natural language queries and the actions taken by a candidate corrigible system built on the language model, then this sort of query is much stronger evidence. I still wouldn’t call it strong compared to a more direct evaluation, but in this case, guessing that the maybeCorrigibleBot will behave more like the sample query implies is reasonable.
In other words:
If you intentionally build a system where the two are actually close enough to the same thing, this is no longer a confusion.
If my understanding of your position is correct: you wouldn’t disagree with that claim, but you would doubt there’s a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability. You would expect many simple test cases with current systems like RLHF’d GPT4 in an AutoGPT-ish scaffold with a real shutdown button to work but would consider that extremely weak evidence about the safety properties of a similar system built around GPT-N in the same scaffold.
If I had to guess where we might disagree, it would be in the degree to which language models with architectures similar-ish to current examples could yield a system with properties that permit corrigibility. I’m pretty optimistic about this in principle; I think a there is a subset of predictive training that yields high capability with an extremely constrained profile of “values” that make the system goal agnostic by default. I think there’s a plausible and convergent path to capabilities that routes through corrigible-ish systems by necessity and permits incremental progress on real safety.
I’ve proven pretty bad at phrasing the justifications concisely, but if I were to try again: the relevant optimization pressures during the kinds of predictive training I’m referring to directly oppose the development of unconditional preferences over external world states, and evading these constraints carries a major complexity penalty. The result of extreme optimization can be well-described by a coherent utility function, but one representing only a conditionalized mapping from input to output. (This does not imply or require cognitive or perceptual myopia. This also does not imply that an agent produced by conditioning a predictor remains goal agnostic.)
A second major piece would be that this subset of predictors also gets superhumanly good at “just getting what you mean” (in a particular sense of the phrase) because it’s core to the process of Bayesian inference that they implement. They squeeze enormous amount of information out of every available source of conditions and stronger such models do even more. This doesn’t mean that the base system will just do what you mean, but it is the foundation on which you can more easily build useful systems.
There are a lot more details that go into this that can be found in other walls of text.
On a meta level:
I’ve had the occasional conversation that, vibes-wise, went in this direction (not with John).
It’s sometimes difficult to escape that mental bucket after someone pattern matches you into it, and it’s not uncommon for the heuristic to result in one half the conversation sounding like this post. There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I’m making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.
This isn’t to say “and therefore you should put enormous effort reading the manifesto of every individual who happens to speak with you and never use any conversational heuristics,” but I worry there’s a version of this heuristic happening at the field level with respect to things that could sound like “language models solve corrigibility and alignment.”
Yes, though that’s separate from the point of the post.
The post is not trying to argue that corrigibility in LLMs is difficult, or that demonstrating (weak) corrigibility in LLMs is difficult. The post is saying that certain ways of measuring corrigibility in LLMs fail to do so, and people should measure it in a way which actually measures what they’re trying to measure.
In particular, I am definitely not saying that everyone arguing that LLMs are corrigible/aligned/etc are making the mistake from the post.
I indeed worry about this failure-mode, and am quite open to evidence that I’m mis-modeling people.
(In practice, when I write this sort of thing, I usually get lots of people saying “man, that’s harsh/inconsiderate/undiplomatic/etc” but a notable lack of people arguing that my model-of-other-people is wrong. I would be a lot happier if people actually told me where my model was wrong.)
I mean, fundamentally, I think if someone offers X as evidence of Y in implicit context Z, and is correct about this, but makes a mistake in their reasoning while doing so, a reasonable response is “Good insight, but you should be more careful in way M,” rather than “Here’s your mistake, you’re gullible and I will recognize you only as student,” with zero acknowledgment of X being actually evidence for Y in implicit context Z.
Suppose someone had endorsed some intellectual principles along these lines:
I think applying this norm to judgements about people’s character straightforwardly means that it’s great to show how people make mistakes and to explain them; but the part where you move from “person A says B, which is mistaken in way C” to “person A says B, which is mistaken in way C, which is why they’re gullible” is absolutely not good move under the what-I-understand-to-be-ordinary norms of scientific discourse.
Someone who did that would be straightforwardly making a particular decision on everyone else’s behalf and trying to propagate their conclusion, rather than simply offering evidence.