I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.
Can you say more about how alignment is crucial for usefulness of AIs? I’m thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful—or at least would appear so—until it’s too late.
Indeed, from an instrumental perspective, the ones that arrive at the conclusion that being maximally helpful is best at getting themselves empowered (on all tasks besides supervising copies of themselves or other AIs they are cooperating with), will be much more useful than AIs that care about some random thing and haven’t made the same update that getting that thing is best achieved by being helpful and therefore empowered. “Motivation” seems like it’s generally a big problem with getting value out of AI systems, and so you should expect the deceptively aligned ones to be much more useful (until of course, it’s too late, or they are otherwise threatened and the convergence disappears).
Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.
I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.
Can you say more about how alignment is crucial for usefulness of AIs? I’m thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful—or at least would appear so—until it’s too late.
Indeed, from an instrumental perspective, the ones that arrive at the conclusion that being maximally helpful is best at getting themselves empowered (on all tasks besides supervising copies of themselves or other AIs they are cooperating with), will be much more useful than AIs that care about some random thing and haven’t made the same update that getting that thing is best achieved by being helpful and therefore empowered. “Motivation” seems like it’s generally a big problem with getting value out of AI systems, and so you should expect the deceptively aligned ones to be much more useful (until of course, it’s too late, or they are otherwise threatened and the convergence disappears).
Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.