One thing I am curious about is whether the full set of personas transfers over with distillation or only the main “Assistant persona”? If it leans towards the latter, does that suggest that models which are fully pre-trained via distillation are by default more aligned?
genericname-2
Karma: 2
genericname-2′s Shortform
I’d like to hear Ryan talk more about his opinions on Anthropic and Dario’s writings.
Given that Recursive Self Improvement (RSI) is the main short timelines risk model, why don’t we focus more technical governance efforts on targeted ways to prevent this specific risk?
My understanding is that currently, most governance proposals fall into either the maximalist (halt all frontier progress) or minimalist (transparency, liability, etc) camps. However, it seems like the targeted approach of specifically restricting automated AI R&D loops is underexplored. I think that there is far more political will for this kind of thing while addressing the main risk model.
The difficult part is actually designing this mechanism. I think that a reasonable threshold would be somewhere between the adequacy and the parity point described by Ajeya Cotra (https://www.planned-obsolescence.org/p/six-milestones-for-ai-automation)