Dan H comments on Transcript of Sam Altman’s interview touching on AI safety

Dan H 21 Jan 2023 17:56 UTC
3 points
0
making them have non-causal decision theories
How does it distinctly do that?
- Noosphere89 21 Jan 2023 18:17 UTC
  1 point
  −1
  Parent
  It’s from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it:
  
  non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem).
  
  Basically, the AI is intending to one-box on Newcomb’s problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb’s problem.
  
  Link below:
  
  https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
  - Heighn 22 Jan 2023 12:00 UTC
    2 points
    3
    Parent
    One-boxing on Newcomb’s Problem is good news IMO. Why do you believe it’s bad?
    - Noosphere89 22 Jan 2023 14:10 UTC
      2 points
      −1
      Parent
      It basically comes down to the fact that agents using too smart decision theories like FDT or UDT can fundamentally be deceptively aligned, even if myopia is retained by default.
      
      That’s the problem with one-boxing in Newcomb’s problem, because it implies that our GPTs could very well become deceptively aligned.
      
      Link below:
      
      https://www.lesswrong.com/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia
      
      The LCDT decision theory does prevent deception, assuming it’s implemented correctly.
      
      Link below:
      
      https://www.lesswrong.com/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory