Roko comments on Turing-Test-Passing AI implies Aligned AI

Roko 2 Jan 2025 1:17 UTC
2 points
−1

asking why inner alignment is hard

I don’t think “inner alignment” is applicable here.

If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
- Spencer Ericson 2 Jan 2025 20:44 UTC
  4 points
  2
  Parent
  If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
  Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
  - Roko 13 Jan 2025 18:56 UTC
    3 points
    1
    Parent
    
    You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
    
    yes, that’s true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.
    
    This is essentially why machine learning works at all—you don’t pick a random model that fits your training data well, you pick the smallest one.