For now, such evidence is not really relevant to takeover risk because models are weak and can’t execute on complex world domination plans, but I can imagine such arguments becoming more directly relevant in the future.
Maybe a nit RE phrasing, but the reasoning here doesn’t make sense. It’s relevant to takeover risk even if the model is known to be weak
“such evidence” is not really clear in my sentence. What I meant is that you don’t have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.
Maybe a nit RE phrasing, but the reasoning here doesn’t make sense. It’s relevant to takeover risk even if the model is known to be weak
“such evidence” is not really clear in my sentence. What I meant is that you don’t have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.