Daniel Kokotajlo comments on Nina Panickssery’s Shortform

Daniel Kokotajlo 30 Oct 2025 17:03 UTC
2 points
0
Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.