I’m of mixed opinion about this idea. Future AIs are dangerous in large part because they’re coherent. This means there’s an upside and downside to this kind of research.
Upside) We’ll get to study more accurate model organisms of future dangerous AI earlier
Downside) We’ll get dangerous AI earlier.
Seems to me using this type of research for model organisms is probably a good idea. Using it to ameliorate stuff like jailbreaks or emergent misalignment in systems that will be put in production is probably not.
They did basically this here.
I’m of mixed opinion about this idea. Future AIs are dangerous in large part because they’re coherent. This means there’s an upside and downside to this kind of research.
Upside) We’ll get to study more accurate model organisms of future dangerous AI earlier
Downside) We’ll get dangerous AI earlier.
Seems to me using this type of research for model organisms is probably a good idea. Using it to ameliorate stuff like jailbreaks or emergent misalignment in systems that will be put in production is probably not.