David Scott Krueger (formerly: capybaralet) comments on What can be learned from scary demos? A snitching case study

David Scott Krueger (formerly: capybaralet) 16 Mar 2026 14:59 UTC
LW: 2 AF: 1
0
AF
For now, such evidence is not really relevant to takeover risk because models are weak and can’t execute on complex world domination plans, but I can imagine such arguments becoming more directly relevant in the future.
Maybe a nit RE phrasing, but the reasoning here doesn’t make sense. It’s relevant to takeover risk even if the model is known to be weak
- Fabien Roger 17 Mar 2026 21:53 UTC
  LW: 2 AF: 2
  0
  AF Parent
  “such evidence” is not really clear in my sentence. What I meant is that you don’t have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.