So the concern is that “the AI generates a random number, sees that it passes the Fermat test, and outputs it” is the same as “the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it”, right?
Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to… distinguish them.
Okay, that’s literally tautological. I’m not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem, which seems to formalize the “to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states”.
So the concern is that “the AI generates a random number, sees that it passes the Fermat test, and outputs it” is the same as “the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it”, right?
Mostly—the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can’t understand the simulation, or the mechanics of the sensor tampering, then there’s not much to do after that so it seems like we’re in trouble.
Okay, that’s literally tautological. I’m not sure this problem has any internal structure that makes it possible to engage with further, then.
It seems like there are plenty of hopes:
I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it’s OK.
In fact it is possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it’s not clear how to find it automatically. We could work on that.
We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can’t happen by chance and then declare victory.
More generally, I’m not happy to give up because “in this situation there’s nothing we can do,” I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is’ happening, and how to formalize the kind of assumptions that we’d need to make the problem soluble.
the opaque test is something like an obfuscated physics simulation
I think it’d need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can’t tell whether there’s etheric interference involved or not. The way Fermat’s test can’t tell a Carmichael number from a prime — it just doesn’t interact with the input number in a way that’d reveal the difference between their internal structures.
By analogy, we’d need some “simulation” which doesn’t interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we’d have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn’t fit the bill.
It’s a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it’s provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for “no interference” and “yes interference”.
Models of world-models is a research direction I’m currently very interested in, so hopefully we can just rule that scenario out, eventually.
It seems like there are plenty of hopes
Oh, I agree. I’m just saying that there doesn’t seem to be any other approaches aside from “figure out whether this sort of worst case is even possible, and under what circumstances” and “figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you’re training the AI”.
I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can’t then it suggests a different source of misalignment from the kind of thing I normally worry about.
So the concern is that “the AI generates a random number, sees that it passes the Fermat test, and outputs it” is the same as “the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it”, right?
Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to… distinguish them.
Okay, that’s literally tautological. I’m not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem, which seems to formalize the “to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states”.
Mostly—the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can’t understand the simulation, or the mechanics of the sensor tampering, then there’s not much to do after that so it seems like we’re in trouble.
It seems like there are plenty of hopes:
I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it’s OK.
In fact it is possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it’s not clear how to find it automatically. We could work on that.
We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can’t happen by chance and then declare victory.
More generally, I’m not happy to give up because “in this situation there’s nothing we can do,” I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is’ happening, and how to formalize the kind of assumptions that we’d need to make the problem soluble.
I think it’d need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can’t tell whether there’s etheric interference involved or not. The way Fermat’s test can’t tell a Carmichael number from a prime — it just doesn’t interact with the input number in a way that’d reveal the difference between their internal structures.
By analogy, we’d need some “simulation” which doesn’t interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we’d have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn’t fit the bill.
It’s a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it’s provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for “no interference” and “yes interference”.
Models of world-models is a research direction I’m currently very interested in, so hopefully we can just rule that scenario out, eventually.
Oh, I agree. I’m just saying that there doesn’t seem to be any other approaches aside from “figure out whether this sort of worst case is even possible, and under what circumstances” and “figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you’re training the AI”.
I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can’t then it suggests a different source of misalignment from the kind of thing I normally worry about.