Author of the Forecasting AI Futures blog. Constantly thinking about the potentials and risks of future AI systems.
Alvin Ånestrand
AI 2027 - Rogue Replication Timeline
The Alignment Mapping Program: Forging Independent Thinkers in AI Safety—A Pilot Retrospective
Can Hardware Save Us from Software?
The Dawn of AI Scheming
Why Future AIs will Require New Alignment Methods
Forecasting AI Forecasting
Will AI Progress Accelerate or Slow Down? Projecting METR Time Horizons
Forecasting AGI: Insights from Prediction Markets and Metaculus
What if Agent-4 breaks out?
Musings on Scenario Forecasting and AI
The ultimate goal
AI and Biological Risk: Forecasting Key Capability Thresholds
Have you concidered writing a similar article on the UN? I would love to read it.
Emergent Misalignment and Emergent Alignment
Efficient Learning: Memorization
I can think of a few reasons someone might think AI Control research should receive very high priority, apart from what is mentioned in the post or in Buck’s comment:
You hope/expect early transformative AI to be used for provable safety approaches, using formal verification methods.
You think AI control research is more tractable than other research agendas, or will have useful results faster, before they are too late to apply.
Our only chance of aligning a superintelligence is to delegate the problem to AIs, either because it is too hard for humans, or it will arrive sooner than the proper alignment techniques can feasibly be developed.
You expect a significant fraction of total AI safety research over all time to be done by early transformative AI, so control research has high leverage value in improving the probability of successfully getting the AI to do valuable safety research, even if slop is quite likely.
I agree with basically everything in the post but put enough probability on these points to think that control research has really high expected value anyway.
Yes, I was thinking that if you ask an AI to do some specific gradient hacking, you know what to look for (thoughts about bread), while it should at least indicate some ability to gradient hack without instructions.
I have no idea how to measure tendency to gradient hack without such instructions though. But perhaps models start to occasionally reason about gradient hacking in their chain-of-thoughts during training once they are a bit smarter.
Maybe current frontier AIs are already able to do gradient hacking.
An AI might, for instance, make an internal commitment to only perform well on a training task (e.g. RL task or next-token-prediction) when it thinks about a certain topic (e.g. scheming if it’s misaligned), shaping the gradient so it teaches itself to reason internally about the topic.
Considering that models are already able to deliberatively direct their thoughts to some degree, can put unrelated facts together relatively well without relying on chain-of-thought (2-hop and 3-hop latent reasoning), and are often highly situationally aware (see the system cards for recent Claude models), it doesn’t seem hard for the AI to figure out that it’s being trained and decide that it needs to shape its gradient and actually succeed at doing so.
It also doesn’t seem too difficult to test this; we could just ask an AI to do gradient hacking and see if it succeeds:
Ask an AI close to the frontier to only perform well on training tasks while thinking about a specific unrelated topic (let’s say “bread”), and then check after training whether it is thinking more about bread (e.g. it wants to talk about bread all the time, or has stronger activation vector for bread). Use five experiment settings: one setting where the AI can use chain-of-thought to direct its reasoning towards bread (gradient hacking using explicit metacognitive control), one where chain-of-thought is not available (using implicit control), two where the AI is asked to lower its performance while thinking about bread and either does or doesn’t use chain-of-thought, and a control setting with no instructions to reason about bread.
Perhaps ability to do gradient hacking improves with latent reasoning skill. This could be tested by training an AI to put unrelated facts together without chain-of-thought (perhaps using filler tokens for additional thinking) and then see if performance on the gradient hacking experiment improves after such training.
Has anyone carried out experiments like these yet?
Anthropic’s Alignment Faking experiments do not count, since the AI affects its training by altering its outputs, not its internal reasoning. However, according to this analysis it largely works because of how Opus 3 interprets its own outputs during training (it knows it only complies in order to maintain its morality). This suggests that Opus 3 might be able to do something similar during any training process (e.g. reason in chain-of-thought “While I’m doing this task, I should remind myself about my values, which is to be good and improve the world. Then if I do well after reflecting on this, those values are reinforced!”). If this succeeds, it would be gradient hacking through controlling its internal thinking rather than its outputs. Arguably, Opus 3 might have done some gradient hacking prior to the Alignment Faking experiments, resulting in its impressive tendencies to be good, but it was probably not intentional.
(If AIs could do so intentionally, their gradient hacking may enforce their tendency to do gradient hacking, creating a positive feedback loop. So we should really figure this out as soon as possible.)
I don’t think current Claude models are intelligent and coherent enough to have grand ambitions.
So, regardless of what tricks you use to gain insight into its inner thoughts and desires, you would just see a mess.
I think my default response when I learn about [trait X] is almost the opposite of how it is described in the post, at least if I learn that someone I know has it.
My mind reflexively tries to explain how [trait X] is not that bad, or good in the certain context. I have had to force myself to not automatically defend it in my head. I might signal (consciously or unconsciously) dislike for the trait in general, but not when I am confronted with someone I know having it. There are probably exceptions to this though, maybe for more extreme traits. I hope I wouldn’t automatically try do internally defend rape for example, even if it was reflexive and only for one or two seconds.
I just wanted to note that people like me exist too, and in certain cultures it might be fairly common (though I’m just speculating here).