Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
Yes.