It’s from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:
non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem).
Basically, the AI is intending to one-box on Newcomb’s problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb’s problem.
It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.
That’s the problem with one-boxing in Newcomb’s problem, because it implies that our GPTs could very well become deceptively aligned.
How does it distinctly do that?
It’s from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:
Basically, the AI is intending to one-box on Newcomb’s problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb’s problem.
Link below:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
One-boxing on Newcomb’s Problem is good news IMO. Why do you believe it’s bad?
It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.
That’s the problem with one-boxing in Newcomb’s problem, because it implies that our GPTs could very well become deceptively aligned.
Link below:
https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia
The LCDT decision theory does prevent deception, assuming it’s implemented correctly.
Link below: