The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.
Sure, and I expect there are reasons for assuming that A can’t hide a flaw from B, e.g. maybe one can prove that honest and dishonest can’t be distinguished if it can, but I don’t see this mentioned anywhere.
I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we’ve been intentionally living in the “Alice and Bob have similar heuristics and computational power” for years: the A set in the paper models the set of heuristics, but I am surprised that the term “heuristic” ended up not in the final text (we just speak a lot about relative computational resources).
Fundamentally if you let B attend to either A’s thoughts or to A’s context and also ensure A and B are similar, we think A shouldn’t be able to hide things from B. If A is a very different model, or a stronger model, this won’t hold, and we should expect arguments which A can navigate in ways B can’t defend against. So in order to get anything like this to go through you need some kind of “B knows what A knows” invariant.
What about when A knows where the flaw is but B doesn’t?
The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.
Sure, and I expect there are reasons for assuming that A can’t hide a flaw from B, e.g. maybe one can prove that honest and dishonest can’t be distinguished if it can, but I don’t see this mentioned anywhere.
I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we’ve been intentionally living in the “Alice and Bob have similar heuristics and computational power” for years: the A set in the paper models the set of heuristics, but I am surprised that the term “heuristic” ended up not in the final text (we just speak a lot about relative computational resources).
Fundamentally if you let B attend to either A’s thoughts or to A’s context and also ensure A and B are similar, we think A shouldn’t be able to hide things from B. If A is a very different model, or a stronger model, this won’t hold, and we should expect arguments which A can navigate in ways B can’t defend against. So in order to get anything like this to go through you need some kind of “B knows what A knows” invariant.