Thane Ruthenis comments on Can we efficiently distinguish different mechanisms?

Thane Ruthenis 5 Jan 2023 21:39 UTC
LW: 1 AF: 1
0
AF
the opaque test is something like an obfuscated physics simulation
I think it’d need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can’t tell whether there’s etheric interference involved or not. The way Fermat’s test can’t tell a Carmichael number from a prime — it just doesn’t interact with the input number in a way that’d reveal the difference between their internal structures.
By analogy, we’d need some “simulation” which doesn’t interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we’d have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn’t fit the bill.
It’s a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it’s provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for “no interference” and “yes interference”.
Models of world-models is a research direction I’m currently very interested in, so hopefully we can just rule that scenario out, eventually.
It seems like there are plenty of hopes
Oh, I agree. I’m just saying that there doesn’t seem to be any other approaches aside from “figure out whether this sort of worst case is even possible, and under what circumstances” and “figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you’re training the AI”.
- paulfchristiano 5 Jan 2023 22:22 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can’t then it suggests a different source of misalignment from the kind of thing I normally worry about.