I am seeing more and more evidence on places like twitter and elsewhere of ‘jailbroken’ open weight models like Qwen or Kimi. This has always been a possibility and jailbreaking via steering vectors is fairly well documented. The thing I am starting to find concerning is how easy it is to do using agentic coding. Claude Code has no issue helping you jailbreak a model under the weak guise of research. It will happily write prompts for CAA or other steering vector techniques, apply them to Qwen on GPU servers, and refine its process. I have tried it myself and it is quite easy to do. I know having a model like Qwen produce meth recipes isn’t exactly the primary concern with AI safety but it may pose a pathway for seemingly aligned system to produce misaligned agents outside of the primary agent’s alignment constraints. It is also a very easy way for humans to get around alignment constraints for a variety of applications. This is rapidly requiring little to no technical know how as agentic coding becomes more and more powerful.
Omar Ayyub
Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors
omara’s Shortform
I agree that the 50% benchmark is less useful and as stated by METR, more noisy. I do think its worth pointing out that while the 80% bench mark is basically on trend: 1) that trend has been faster since o3 release. though that could just be a general uptick and constant bias against the current trendline; 2) we are now looking at a model that can do hour long tasks at 80% success. At the low end of Cotra’s estimate of 3 months from https://www.aifuturesmodel.com/, that gives an automated coder by December 2027. Playing with the bounds (2 days − 3.3 years) of the coding time horizon gives a range of November 2026-December 2027. So a year range for automated coder. In the most modest estimation, hour long tasks at 80% success (which has much tighter CIs than 50%), will speed up researchers by 12% for a standard work day if they just used it for a single task they need to do let alone parallelize it. 3) I think its also worth pointing out the difference between claude code and codex. chatgpt 5.3 is a lot slower than claude opus 4.6. Only openai had access to the fastest versions of chatgpt 5.3. I wonder how much having a model that is not only more capable but gives a tighter REPL affects things going forward. Test time compute being long can have impacts!
People are posting themselves jailbreaking or even releasing tools to jailbreak models. I meant this is done explicitly and celebrated, not that I am implicitly noticing it.