Daniel Kokotajlo comments on Nina Panickssery’s Shortform

Daniel Kokotajlo 29 Oct 2025 20:54 UTC
9 points
6
I always thought the point of activation steering was for safety/alignment/interpretability/science/etc., not capabilities.
- Nina Panickssery 29 Oct 2025 22:05 UTC
  6 points
  2
  Parent
  Not sure what distinction you’re making. I’m talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. “be truthful”) and “capabilities” (e.g. “write in X style”) interventions.
  - Daniel Kokotajlo 30 Oct 2025 17:03 UTC
    2 points
    0
    Parent
    Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.