williawa answers LLM coherentization as an obvious low-hanging fruit to try?

williawa 5 Mar 2026 10:13 UTC
3 points
2
They did basically this here.
I’m of mixed opinion about this idea. Future AIs are dangerous in large part because they’re coherent. This means there’s an upside and downside to this kind of research.
1. Upside) We’ll get to study more accurate model organisms of future dangerous AI earlier
2. Downside) We’ll get dangerous AI earlier.
Seems to me using this type of research for model organisms is probably a good idea. Using it to ameliorate stuff like jailbreaks or emergent misalignment in systems that will be put in production is probably not.