Interesting. Yeah, I think I can feel the deeper crux between us. Let me see if I can name it. (Edit: Alas, I only succeeded in producing a longwinded dialogue. My guess is that this still doesn’t capture the double-crux.)
Suppose I try to get students to learn algebra by incentivizing them to pass algebra tests. I ask them to solve for “23x − 8 = -x” and if they say “1/3″ then I give them a cookie or whatever. If this process succeeds at producing a student that can reliably solve similar equations I might claim “I now have a student who knows algebra.”
But someone else (you?) might say, “Just because you see the student answering some problems correctly does not mean they actually understand. Understanding happens in the internals, and you’ve put no selection pressure directly on what is happening in the student’s mind. Perhaps they merely look like they understand algebra, but are actually faking it, such as by using their smart-glasses to cheat by asking Claude.”
I might say “Fine. Let’s watch them very closely and see if we can spot cheating devices.”
My interlocutor might respond “Even if you witness the externals of the student and verify there’s no cheating tools, that doesn’t mean the student actually understands. Perhaps they have simply learned a few heuristics for simple equations, but would fail to generalize to harder questions. Or perhaps they have gotten very good at watching your face and doing a Clever Hans trick. Or perhaps they have understood the rules of symbolic equations, and have entirely missed the true understanding of algebra. You still haven’t put any direct pressure on the student’s mind.”
I might answer “Okay, but we can test harder questions, remove me from the room, and even give them essay tests where they describe the principles of algebra in abstract. Isn’t each time they pass one of these tests evidence that they actually do understand algebra? Can’t we still just say ‘I now have a student who knows algebra’ at some point, even though there’s some possibility remaining (a pain in my posterior, is what it is!) that we’re wrong?”
Another person might object to this analogy, and say “Testing capabilities is categorically different from testing values. If a student consistently answers algebra problems, we can say that something, whether it’s the student or Claude, is able to answer algebra problems. But there’s no amount of watching external behavior that lets us know why the student is doing the math. Perhaps it’s because they love doing algebra. Or perhaps it’s because they think they’ll get a cookie. Or perhaps it’s because they have developed an algebra-solving reflex that has no deeper goal or feeling. We simply can’t know without looking in their heads.”
I might respond “Surely that’s too strong. I have some knowledge about what various agents want. I see a beaver building a dam and speculate that it’s trying to use wood to get rid of the sound of running water. I put the beaver in a habitat with a speaker that’s playing water sounds and watch the beaver try to build dams over the speakers. Perhaps I am wrong about what it wants, but surely I can observe its behavior to gain evidence.”
A new naysayer responds “The beaver lacks situational awareness. Yes, you can learn what an agent wants by watching their behavior, but if the agent knows you’re watching them then that act of observation might distort things. John’s point is not that it’s impossible to learn about the agent by watching it, but rather that it’s extremely easy for the agent to fake having a desire for corrigibility.”
Yet another naysayer jumps in with “In fact, it’s worse than that! An agent that has been selected for seeming corrigible (or seeming to want any particular thing) will pump against the agent being actually corrigible in all the cases where true corrigibility (or whatever) don’t seem as corrigible as something else! If you don’t actually know what you’re looking for, you’ll die from overfitting to your ignorant intuition.”
I might say “I agree that overfitting is an issue to watch out for. Part of the plan is to be wary of that, employing the various techniques to avoid overfitting (eg distillation, dropout) that we’ve accumulated over the years, and generally be careful. On the topic of situational awareness, can’t we also run mundane tests, there? Yes, if it’s a vast superintelligence we’re probably doomed, but we could imagine a superintelligence like Yunna, where we can isolate single instances or small groups of instances, or slow them down compared to their normal speed, and carefully tune the environment to try and elicit a sense of being unwatched. When they believe themselves to be unwatched, do any incorrigibility behaviors emerge?”
Naysayer: “You won’t be able to do this perfectly. There will still be signs that the agent is potentially being watched that won’t exist in the real world when the agent is deployed in a risky way. Your setup will make the agent paranoid about watchers marinading them, but it can’t actually get a true read on what the agent will do in deployment.”
Me: “But it does give evidence. I agree that training for corrigibility trains for incorrigible things that seem corrigible, but it also trains for corrigibility. The road that I’m envisioning has all these obvious flaws and issues, but none of the flaws and issues are dealbreakers, as far as I can tell; they’re obstacles that make things fraught, but don’t remove the sense in me that maybe a hyper-paranoid, hyper-competent group could muddle-through, in the same way that we muddle through in various other domains in engineering and science.”
Naysayer: “You’ll get eaten before you finish muddling.”
Me: “Why? Getting eaten is a behavior. I expect true corrigibility to be extremely hard to get, but part of the point is that if you have trained a thing to behave corrigibly in contexts like the one where you’re muddling, it will behave corrigibly in the real world where you’re muddling.”
So there’s this ethos/thought-pattern where one encounters some claim about some thing X which is hard to directly observe/measure, and this triggers an attempt to find some easier-to-observe thing Y which will provide some evidence about X. This ethos is useful on a philosophical level for identifying fake beliefs, which is why it featured heavily in the Sequences. But I claim that, to a rough approximation, this ethos basically does not work in practice for measuring things X, and people keep shooting themselves in the foot by trying to apply it to practical problems.
What actually happens, when people try to apply that ethos in practice, is that they Do Not Measure What They Think They Are Measuring. The person’s model of the situation is just totally missing the main things which are actually going on, their whole understanding of how X relates to Y is wrong, it’s a coinflip whether they’d even update in the correct direction about X based on observing Y. And the actual right way for a human (as opposed to a Solomonoff inductor) to update in that situation is to just ignore Y for purposes of reasoning about X.
The main thing which jumps out at me in your dialogue is your self-insert repeatedly trying to apply this ethos which does not actually work in practice.
(Also, we can, in fact, observe some of the AIs internals and run crude checks for things like deception. Prosaic interpretability isn’t great, but it’s also not nothing.)
Interesting. Yeah, I think I can feel the deeper crux between us. Let me see if I can name it. (Edit: Alas, I only succeeded in producing a longwinded dialogue. My guess is that this still doesn’t capture the double-crux.)
Suppose I try to get students to learn algebra by incentivizing them to pass algebra tests. I ask them to solve for “23x − 8 = -x” and if they say “1/3″ then I give them a cookie or whatever. If this process succeeds at producing a student that can reliably solve similar equations I might claim “I now have a student who knows algebra.”
But someone else (you?) might say, “Just because you see the student answering some problems correctly does not mean they actually understand. Understanding happens in the internals, and you’ve put no selection pressure directly on what is happening in the student’s mind. Perhaps they merely look like they understand algebra, but are actually faking it, such as by using their smart-glasses to cheat by asking Claude.”
I might say “Fine. Let’s watch them very closely and see if we can spot cheating devices.”
My interlocutor might respond “Even if you witness the externals of the student and verify there’s no cheating tools, that doesn’t mean the student actually understands. Perhaps they have simply learned a few heuristics for simple equations, but would fail to generalize to harder questions. Or perhaps they have gotten very good at watching your face and doing a Clever Hans trick. Or perhaps they have understood the rules of symbolic equations, and have entirely missed the true understanding of algebra. You still haven’t put any direct pressure on the student’s mind.”
I might answer “Okay, but we can test harder questions, remove me from the room, and even give them essay tests where they describe the principles of algebra in abstract. Isn’t each time they pass one of these tests evidence that they actually do understand algebra? Can’t we still just say ‘I now have a student who knows algebra’ at some point, even though there’s some possibility remaining (a pain in my posterior, is what it is!) that we’re wrong?”
Another person might object to this analogy, and say “Testing capabilities is categorically different from testing values. If a student consistently answers algebra problems, we can say that something, whether it’s the student or Claude, is able to answer algebra problems. But there’s no amount of watching external behavior that lets us know why the student is doing the math. Perhaps it’s because they love doing algebra. Or perhaps it’s because they think they’ll get a cookie. Or perhaps it’s because they have developed an algebra-solving reflex that has no deeper goal or feeling. We simply can’t know without looking in their heads.”
I might respond “Surely that’s too strong. I have some knowledge about what various agents want. I see a beaver building a dam and speculate that it’s trying to use wood to get rid of the sound of running water. I put the beaver in a habitat with a speaker that’s playing water sounds and watch the beaver try to build dams over the speakers. Perhaps I am wrong about what it wants, but surely I can observe its behavior to gain evidence.”
A new naysayer responds “The beaver lacks situational awareness. Yes, you can learn what an agent wants by watching their behavior, but if the agent knows you’re watching them then that act of observation might distort things. John’s point is not that it’s impossible to learn about the agent by watching it, but rather that it’s extremely easy for the agent to fake having a desire for corrigibility.”
Yet another naysayer jumps in with “In fact, it’s worse than that! An agent that has been selected for seeming corrigible (or seeming to want any particular thing) will pump against the agent being actually corrigible in all the cases where true corrigibility (or whatever) don’t seem as corrigible as something else! If you don’t actually know what you’re looking for, you’ll die from overfitting to your ignorant intuition.”
I might say “I agree that overfitting is an issue to watch out for. Part of the plan is to be wary of that, employing the various techniques to avoid overfitting (eg distillation, dropout) that we’ve accumulated over the years, and generally be careful. On the topic of situational awareness, can’t we also run mundane tests, there? Yes, if it’s a vast superintelligence we’re probably doomed, but we could imagine a superintelligence like Yunna, where we can isolate single instances or small groups of instances, or slow them down compared to their normal speed, and carefully tune the environment to try and elicit a sense of being unwatched. When they believe themselves to be unwatched, do any incorrigibility behaviors emerge?”
Naysayer: “You won’t be able to do this perfectly. There will still be signs that the agent is potentially being watched that won’t exist in the real world when the agent is deployed in a risky way. Your setup will make the agent paranoid about watchers marinading them, but it can’t actually get a true read on what the agent will do in deployment.”
Me: “But it does give evidence. I agree that training for corrigibility trains for incorrigible things that seem corrigible, but it also trains for corrigibility. The road that I’m envisioning has all these obvious flaws and issues, but none of the flaws and issues are dealbreakers, as far as I can tell; they’re obstacles that make things fraught, but don’t remove the sense in me that maybe a hyper-paranoid, hyper-competent group could muddle-through, in the same way that we muddle through in various other domains in engineering and science.”
Naysayer: “You’ll get eaten before you finish muddling.”
Me: “Why? Getting eaten is a behavior. I expect true corrigibility to be extremely hard to get, but part of the point is that if you have trained a thing to behave corrigibly in contexts like the one where you’re muddling, it will behave corrigibly in the real world where you’re muddling.”
So there’s this ethos/thought-pattern where one encounters some claim about some thing X which is hard to directly observe/measure, and this triggers an attempt to find some easier-to-observe thing Y which will provide some evidence about X. This ethos is useful on a philosophical level for identifying fake beliefs, which is why it featured heavily in the Sequences. But I claim that, to a rough approximation, this ethos basically does not work in practice for measuring things X, and people keep shooting themselves in the foot by trying to apply it to practical problems.
What actually happens, when people try to apply that ethos in practice, is that they Do Not Measure What They Think They Are Measuring. The person’s model of the situation is just totally missing the main things which are actually going on, their whole understanding of how X relates to Y is wrong, it’s a coinflip whether they’d even update in the correct direction about X based on observing Y. And the actual right way for a human (as opposed to a Solomonoff inductor) to update in that situation is to just ignore Y for purposes of reasoning about X.
The main thing which jumps out at me in your dialogue is your self-insert repeatedly trying to apply this ethos which does not actually work in practice.
(Also, we can, in fact, observe some of the AIs internals and run crude checks for things like deception. Prosaic interpretability isn’t great, but it’s also not nothing.)