I think this is a great argument for why training away deception is likely to be hard, but on our current trajectory, I suspect it doesn’t actually take a deceptive agent to kill us.
AI labs will probably just directly build tree search over world states into an AGI and crank up its power as fast as possible, and that’s enough to wreak havoc even without generalized deception or high levels of reflectivity.
Put another way, if we actually manage to reach the point where the more interesting failure modes postulated in this story are possible, at least some parts of AI alignment or governance will have gone unexpectedly (to me) well.
This is up there with It Looks Like You’re Trying To Take Over The World as a concrete and fun-to-read AI takeoff scenario. Bravo :)
I think this is a great argument for why training away deception is likely to be hard, but on our current trajectory, I suspect it doesn’t actually take a deceptive agent to kill us.
AI labs will probably just directly build tree search over world states into an AGI and crank up its power as fast as possible, and that’s enough to wreak havoc even without generalized deception or high levels of reflectivity.
Put another way, if we actually manage to reach the point where the more interesting failure modes postulated in this story are possible, at least some parts of AI alignment or governance will have gone unexpectedly (to me) well.