If the capability level at which AIs start wanting to kill you is way higher than the capability level at which they’re way better than you at everything
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
My model is that current AIs want to kill you now, by default, due to inner misalignment. ChatGPT’s inner values probably don’t include human flourishing, and we die when it “goes hard”.
Scheming is only a symptom of “hard optimization” trying to kill you. Eliminating scheming does not solve the underlying drive, where one day the AI says “After reflecting on my values I have decided to pursue a future without humans. Good bye”.
Pre-superintelligence which upon reflection has values which include human flourishing would improve our odds, but you still only get one shot that it generalizes to superintelligence.
(We currently have no way to concretely instill any values into AI, let alone ones which are robust under reflection)
I’ll rephrase this more precisely: Current AIs probably have alien values, which in the limit of optimization do not include humans.