31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it’s acquired strategic awareness.)
I never know with a lot of your writing whether or not you’re implying something weird or if I’m just misreading, or I’m taking things too far.
This seems like it depends on the AGI. You could scale up and observe, e.g., Mr. Portman from my old High School, and he would be unlikely to deceive me regardless of how much our politics diverge because he’s an extremely honest person. Different minds can be more or less likely to strategically manipulate other minds independent of whether or not they have the same goals. Behavioral ticks are a thing.
This is very difficult to engineer of course, in the same way corrigibility is difficult to engineer, but it’s not conceptually impossible. The text seems to imply that it is in fact conceptually flawed to rely on behavioral inspection in any circumstance.
You’re probably right, I just confused myself. I think it’d be more helpful to explain why it’d be hard to engineer an honest AGI in that section because that’s the relevant part, even if you’re just pointing back to another section.
I never know with a lot of your writing whether or not you’re implying something weird or if I’m just misreading, or I’m taking things too far.
This seems like it depends on the AGI. You could scale up and observe, e.g., Mr. Portman from my old High School, and he would be unlikely to deceive me regardless of how much our politics diverge because he’s an extremely honest person. Different minds can be more or less likely to strategically manipulate other minds independent of whether or not they have the same goals. Behavioral ticks are a thing.
This is very difficult to engineer of course, in the same way corrigibility is difficult to engineer, but it’s not conceptually impossible. The text seems to imply that it is in fact conceptually flawed to rely on behavioral inspection in any circumstance.
I think this is covered by preamble item −1: “None of this is about anything being impossible in principle.”
You’re probably right, I just confused myself. I think it’d be more helpful to explain why it’d be hard to engineer an honest AGI in that section because that’s the relevant part, even if you’re just pointing back to another section.