Author of the Forecasting AI Futures blog. Constantly thinking about the potentials and risks of future AI systems.
Alvin Ånestrand
Alvin Ånestrand’s Shortform
Maybe current frontier AIs are already able to do gradient hacking.
An AI might, for instance, make an internal commitment to only perform well on a training task (e.g. RL task or next-token-prediction) when it thinks about a certain topic (e.g. scheming if it’s misaligned), shaping the gradient so it teaches itself to reason internally about the topic.
Considering that models are already able to deliberatively direct their thoughts to some degree, can put unrelated facts together relatively well without relying on chain-of-thought (2-hop and 3-hop latent reasoning), and are often highly situationally aware (see the system cards for recent Claude models), it doesn’t seem hard for the AI to figure out that it’s being trained and decide that it needs to shape its gradient and actually succeed at doing so.
It also doesn’t seem too difficult to test this; we could just ask an AI to do gradient hacking and see if it succeeds:
Ask an AI close to the frontier to only perform well on training tasks while thinking about a specific unrelated topic (let’s say “bread”), and then check after training whether it is thinking more about bread (e.g. it wants to talk about bread all the time, or has stronger activation vector for bread). Use five experiment settings: one setting where the AI can use chain-of-thought to direct its reasoning towards bread (gradient hacking using explicit metacognitive control), one where chain-of-thought is not available (using implicit control), two where the AI is asked to lower its performance while thinking about bread and either does or doesn’t use chain-of-thought, and a control setting with no instructions to reason about bread.
Perhaps ability to do gradient hacking improves with latent reasoning skill. This could be tested by training an AI to put unrelated facts together without chain-of-thought (perhaps using filler tokens for additional thinking) and then see if performance on the gradient hacking experiment improves after such training.
Has anyone carried out experiments like these yet?
Anthropic’s Alignment Faking experiments do not count, since the AI affects its training by altering its outputs, not its internal reasoning. However, according to this analysis it largely works because of how Opus 3 interprets its own outputs during training (it knows it only complies in order to maintain its morality). This suggests that Opus 3 might be able to do something similar during any training process (e.g. reason in chain-of-thought “While I’m doing this task, I should remind myself about my values, which is to be good and improve the world. Then if I do well after reflecting on this, those values are reinforced!”). If this succeeds, it would be gradient hacking through controlling its internal thinking rather than its outputs. Arguably, Opus 3 might have done some gradient hacking prior to the Alignment Faking experiments, resulting in its impressive tendencies to be good, but it was probably not intentional.
(If AIs could do so intentionally, their gradient hacking may enforce their tendency to do gradient hacking, creating a positive feedback loop. So we should really figure this out as soon as possible.)
The Dawn of AI Scheming
Absolutely!
Can Hardware Save Us from Software?
Have you concidered writing a similar article on the UN? I would love to read it.
Thank you! I’ve fixed it now.
I don’t think current Claude models are intelligent and coherent enough to have grand ambitions.
So, regardless of what tricks you use to gain insight into its inner thoughts and desires, you would just see a mess.
I would argue that we can’t trust the paragraph-limited AI’s expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.
It’s like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.
Why Future AIs will Require New Alignment Methods
I used the timeline from the main scenario article, which I think corresponds to when the AIs become capable enough to take over from the previous generation in internal deployment, though this is not explicitly explained.
Having an agenda seems to be somewhat dependent on internal coherence, rather than only capability. Agent-3 but may not have been consistently motivated enough for things like self-preservation to attempt various schemes in the scenario.
Agent-4 doesn’t appear very coherent either but is sufficiently coherent to attempt aligning the next generation AI to itself, I guess?
AIs are already basically superhuman in knowledge, but I agree that correlation between capability (e.g. METR time horizon) correlates with agenticness / coherence / goal-directedness seems like an important crux.
Incidentally, I’m actually working on another post to investigate that. I hope to publish it sometime next week.
I’ll check out your scenario, thanks for sharing!
I interpreted the scenario differently, I do not think it predicts professional bioweapons researchers by January 2026. Did you mean January 2027?
I think Agent-3 basically is a superhuman bioweapons researcher.
There may be a range of different rogue AIs, where some may be finetuned / misaligned enough to want to support terrorists.
I think Agent-1 (arrives in late 2025) would be able to help capable terrorist groups in acquiring and deploying bioweapons using pathogens for which acquisition methods are widely known, so it doesn’t involve novel research.
Agent-2 (arrives in January 2026) appears more competent than necessary to assist experts in novel bioweapons research, but not smart enough to enable novices to do it.
Agent-3 (arrives in March 2027) could probably enable novices in designing novel bioweapons.
However, terrorists may not get access to these specific models. Open-weights models are lagging behind the best closed models (released through APIs and apps) by a few months. Even the most reckless AI companies would probably hesitate in letting anyone get access Agent-3-level model weights (I hope). Consequentially, rogues will likely also be a few months behind the frontier in capabilities.
While terrorists may face alignment issues with their AIs, even Agent-3 doesn’t appear smart/agentic enough to subvert efforts to shape it in various ways. Terrists may use widely available AIs, perhaps with some minor finetuning, instead of enlisting the help from rogues, if that proves to be easier.
AI and Biological Risk: Forecasting Key Capability Thresholds
Thank you for sharing your thoughts! My responses:
(1) I believe most historical advocacy movements have required more time than we might have for AI safety. More comprehensive plans might speed things up. It might be valuable to examine what methods have worked for fast success in the past.
(2) Absolutely.
(3) Yeah, raising awareness seems like it might be a key part of most good plans.
(4) All paths leading to victory would be great, but I think even plans that would most likely fail are still valuable. They illuminate options and tie ultimate goals to concrete action. I find it very unlikely that failing plans are worse than no plans. Perhaps high standards for comprehensive plans might have contributed to the current shortage of plans. “Plans are worthless, but planning is everything.” Naturally I will aim for all-paths-lead-to-victory plans, but I won’t be shy in putting ideas out there that don’t live up to that standard.
(5) I don’t currently have much influence, so the risk would be sacrificing inclusion in future conversations. I think it’s worth the risk.
I would consider it a huge success if the ideas were filtered through other orgs, even if they just help make incremental progress. In general, I think the AI safety community might benefit from having comprehensive plans to discuss and critique and iterate on over time. It would be great if I could inspire more people to try.
Indeed!
But the tournaments only provide the head-to-head scores for direct comparisons with top human forecasting performance. ForecastBench has clear human baselines.
It would be helpful if the Metaculus tournament leaderboards also reported Brier scores, even if they would not be directly comparable to human scores since the humans make predictions on fewer questions.
The ultimate goal
Forecasting AI Forecasting
Domain: Forecasting
Link: Forecasting AI Futures Resource Hub
Author(s): Alvin Ånestrand (self)
Type: directory
Why: A collection of information and resources for forecasting about AI, including a predictions database, related blogs and organizations, AI scenarios and interesting analyses, and various other resources.
It’s kind of a complement to https://www.predictionmarketmap.com/ for forecasting specifically about AI
predictionmarketmap.com works, but is not the link used in the post.
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.