Seth Herd comments on Eli’s shortform feed

Seth Herd 14 Apr 2025 19:22 UTC
5 points
2
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
- Lucius Bushnaq 15 Apr 2025 7:42 UTC
  2 points
  0
  Parent
  you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
  Yes.