I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.
I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.