I’m not sure I understand. We might not be on the same page.
Here’s the concern I’m addressing: Let’s say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that’s better than human feedback/imitation.
Here’s the point I am making about that concern: It might actually be quite easy to scale an already aligned AGI up to superintelligence—even if you don’t have a scalable outer-aligned training signal—because the AGI will be motivated to crystallize its aligned objective.
I’m not sure I understand. We might not be on the same page.
Here’s the concern I’m addressing:
Let’s say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that’s better than human feedback/imitation.
Here’s the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence—even if you don’t have a scalable outer-aligned training signal—because the AGI will be motivated to crystallize its aligned objective.