Roko comments on Turing-Test-Passing AI implies Aligned AI

Roko 13 Jan 2025 18:56 UTC
3 points
1

You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.

yes, that’s true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.

This is essentially why machine learning works at all—you don’t pick a random model that fits your training data well, you pick the smallest one.