mattmacdermott comments on Daniel Tan’s Shortform

mattmacdermott 24 Nov 2025 9:42 UTC
2 points
0

Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”

Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”

Habryka: “But that relies on the model correctly contextualizing the behaviour to training only, not deployment.”

Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough

Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
- Daniel Tan 24 Nov 2025 11:18 UTC
  4 points
  0
  Parent
  Yes, you’re right. That’s the actual distinction that matters. Will edit the comment