Author of the Forecasting AI Futures blog. Constantly thinking about the potentials and risks of future AI systems.
Alvin Ånestrand
Hmm, I was thinking of an operationalization along the lines of “If all staff at a frontier AI company was replaced by an AI, could the AI keep the same pace of development as the human staff with no access to post 2022 AI systems?”
So on the first point about taking suggestions X or Y from a human, that would be allowed as long as those suggestions are not from staff at the company. The AI has to do everything, but it’s also allowed to do anything that the human staff would be allowed to do, as well as use strategies that are more suitable for AIs than humans.
I think current AIs are far from this threshold, but could also reach it fairly soon if development is superexponential.
I would be impressed and quite surprised if it could do novel hardware discoveries when only provided the original idea and no further prompting, and would update towards AI being even more capable than I thought.
Interesting anecdote!
The operationalization of SAR used in the AI Futures Model is this: “A SAR, if dropped into an AGI project in the present-day, would be as good at research taste as if there were only human researchers, who were each made as competent at research taste as the top researcher. The SAR must also qualify as an automated coder (AC), i.e. be able to fully automate coding in the AGI project.”
In truth, I missed this part, and thought it was operationalized as an AI that could automate all AI R&D, but might do this through generating a huge number of hypotheses and experiments rather than having really good research taste. That still requires that the AI comes up with the ideas unassisted though. This is an important threshold in capabilities and level of R&D uplift, I think, since human competence would no longer be a major bottleneck even in idea generation.
Related point: In general I think it is better practise to just discuss specific capability levels without using abbreviations, including AGI or ASI, to make things less confusing (unless you need to refer to a specific level many times making the abbreviation more useful). I’ve been considering removing the SAR abbreviation. Might do that when I get the time.
Thanks, that’s a good point. I wrote most of this article before Opus 4.7 was released, and haven’t integrated all new info provided. I’ll add some discussion on this when I get back to my computer.
Over Eight Months of Progress in Two: Analyzing the Mythos Preview Capability Jump
Will AI Progress Accelerate or Slow Down? Projecting METR Time Horizons
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.
Alvin Ånestrand’s Shortform
Maybe current frontier AIs are already able to do gradient hacking.
An AI might, for instance, make an internal commitment to only perform well on a training task (e.g. RL task or next-token-prediction) when it thinks about a certain topic (e.g. scheming if it’s misaligned), shaping the gradient so it teaches itself to reason internally about the topic.
Considering that models are already able to deliberatively direct their thoughts to some degree, can put unrelated facts together relatively well without relying on chain-of-thought (2-hop and 3-hop latent reasoning), and are often highly situationally aware (see the system cards for recent Claude models), it doesn’t seem hard for the AI to figure out that it’s being trained and decide that it needs to shape its gradient and actually succeed at doing so.
It also doesn’t seem too difficult to test this; we could just ask an AI to do gradient hacking and see if it succeeds:
Ask an AI close to the frontier to only perform well on training tasks while thinking about a specific unrelated topic (let’s say “bread”), and then check after training whether it is thinking more about bread (e.g. it wants to talk about bread all the time, or has stronger activation vector for bread). Use five experiment settings: one setting where the AI can use chain-of-thought to direct its reasoning towards bread (gradient hacking using explicit metacognitive control), one where chain-of-thought is not available (using implicit control), two where the AI is asked to lower its performance while thinking about bread and either does or doesn’t use chain-of-thought, and a control setting with no instructions to reason about bread.
Perhaps ability to do gradient hacking improves with latent reasoning skill. This could be tested by training an AI to put unrelated facts together without chain-of-thought (perhaps using filler tokens for additional thinking) and then see if performance on the gradient hacking experiment improves after such training.
Has anyone carried out experiments like these yet?
Anthropic’s Alignment Faking experiments do not count, since the AI affects its training by altering its outputs, not its internal reasoning. However, according to this analysis it largely works because of how Opus 3 interprets its own outputs during training (it knows it only complies in order to maintain its morality). This suggests that Opus 3 might be able to do something similar during any training process (e.g. reason in chain-of-thought “While I’m doing this task, I should remind myself about my values, which is to be good and improve the world. Then if I do well after reflecting on this, those values are reinforced!”). If this succeeds, it would be gradient hacking through controlling its internal thinking rather than its outputs. Arguably, Opus 3 might have done some gradient hacking prior to the Alignment Faking experiments, resulting in its impressive tendencies to be good, but it was probably not intentional.
(If AIs could do so intentionally, their gradient hacking may enforce their tendency to do gradient hacking, creating a positive feedback loop. So we should really figure this out as soon as possible.)
The Dawn of AI Scheming
Absolutely!
Can Hardware Save Us from Software?
Have you concidered writing a similar article on the UN? I would love to read it.
Thank you! I’ve fixed it now.
I don’t think current Claude models are intelligent and coherent enough to have grand ambitions.
So, regardless of what tricks you use to gain insight into its inner thoughts and desires, you would just see a mess.
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Why Future AIs will Require New Alignment Methods
I used the timeline from the main scenario article, which I think corresponds to when the AIs become capable enough to take over from the previous generation in internal deployment, though this is not explicitly explained.
Having an agenda seems to be somewhat dependent on internal coherence, rather than only capability. Agent-3 but may not have been consistently motivated enough for things like self-preservation to attempt various schemes in the scenario.
Agent-4 doesn’t appear very coherent either but is sufficiently coherent to attempt aligning the next generation AI to itself, I guess?
AIs are already basically superhuman in knowledge, but I agree that correlation between capability (e.g. METR time horizon) correlates with agenticness / coherence / goal-directedness seems like an important crux.
Incidentally, I’m actually working on another post to investigate that. I hope to publish it sometime next week.
I’ll check out your scenario, thanks for sharing!
I interpreted the scenario differently, I do not think it predicts professional bioweapons researchers by January 2026. Did you mean January 2027?
I think Agent-3 basically is a superhuman bioweapons researcher.
There may be a range of different rogue AIs, where some may be finetuned / misaligned enough to want to support terrorists.
I think Agent-1 (arrives in late 2025) would be able to help capable terrorist groups in acquiring and deploying bioweapons using pathogens for which acquisition methods are widely known, so it doesn’t involve novel research.
Agent-2 (arrives in January 2026) appears more competent than necessary to assist experts in novel bioweapons research, but not smart enough to enable novices to do it.
Agent-3 (arrives in March 2027) could probably enable novices in designing novel bioweapons.
However, terrorists may not get access to these specific models. Open-weights models are lagging behind the best closed models (released through APIs and apps) by a few months. Even the most reckless AI companies would probably hesitate in letting anyone get access Agent-3-level model weights (I hope). Consequentially, rogues will likely also be a few months behind the frontier in capabilities.
While terrorists may face alignment issues with their AIs, even Agent-3 doesn’t appear smart/agentic enough to subvert efforts to shape it in various ways. Terrists may use widely available AIs, perhaps with some minor finetuning, instead of enlisting the help from rogues, if that proves to be easier.
I think this pattern sometimes arises because people have different priorities. Some just want to share interesting and related info, or perhaps start a discussion on actually useful changes adressing what went wrong. But then this often sounds like shifting the blame or dismissing the problem.
When doing this I try to be careful so I’m not misinterpreted, adding something in the beginning like “Yeah, that’s terrible and something needs to be done about it. Hmm, maybe [speculation about how the system works in order to find solutions].”