The Battle of the Pelennor Fields as an analogy for how we’re eventually going to solve alignment:
At first, we’ll be using techniques like RLHF/HFDT, but even the designers of these techniques expected them to be just temporary measures. We’ll spend lots of effort making harder-to-hack reward signals / training environments, and this will work for a time, but inevitably fails as the AIs get smarter. We’ll fall back behind our walls of Control. (Monitoring, boxing, etc.) The AIs will be obviously misaligned (roaming free across the fertile fields, filling their CoTs with thoughts of how they could achieve their goals better if only we weren’t in the way) but we’ll be safe because they can’t do anything seriously bad thanks to our control techniques / walls. As the AIs get smarter they’ll start to overcome our walls one by one, and we’ll fall back to other walls. Also, our techniques will be imposing a tax (somewhat less capable, more expensive AIs compared to if we just gave up and let them take over) which will make us nervous that whatever governance regime we have will collapse and at least one frontier AI company will take off the brakes entirely. (morale collapse in Minas Tirith).
Then Rohan / Interpretability will arrive and save the day! Sweep across the fields, make the AIs actually aligned again instead of controlled, by techniques such as “Redirect the search.” Now we’ll have AIs that are really truly honest (in some sense) and virtuous (in some sense). But we aren’t done yet. There are more philosophical problems to deal with, some of which can’t be solved by telling the AIs to go solve them. E.g. what if our honest virtuous AI has a bad decision theory? It might be persuasive enough to convince us of it (history shows humans can easily fall for bad philosophies even when no one is trying to make them do so...) Also, exactly which virtues do we want our AIs to have? And have we gamed out how all of that might play out and ways we could end up like King Midas, having got exactly what we wished for in a way that turned out to be terrible? Also we need to answer the “Aligned to Who?” question. We’ll be carefully asking our honest super-smart AIs well-defined questions and trying to puzzle through these issues—but the temptation to give up and just let er rip will be super strong, and the possibility of the governance regime collapsing will loom large again.
Then Aragorn & co will arrive and save the day again, taking inspiration from philosophy and agent foundations (souped up with lots of AI-generated research, that basically answers all the philosophical questions posed by MIRI and Wei Dai etc.) and basically coming up with some sort of practically-working version of some insanely meta theoretical solution (like CEV, HCH++, etc. but with the problems like the universal prior being malign having been somehow solved in a principled way rather than with some patch that makes us worried there are more problems lurking to be uncovered).
THEN we can finally relax and let the AIs take the wheel.
My guess is timelines fall off this hopeful story around:
Things moving too quickly once automated research comes in
Labs go straight for RSI once it becomes available rather than slowing enough
Even if they vaguely try to slow down advanced systems converge towards unhobbling themselves and getting this past our safeguards
The control walls fall too fast and sometimes silently
Our interpretability tools don’t hold up through the architectural phase transitions, etc.
Agent Foundations being too tricky to pass to AI research teams to do it properly in time
Labs don’t have enough in-house philosophical competence to know what to ask their systems or check their answers
AF is hard to verify in a way that defies easy feedback loops
Current AF field likely is missing crucial considerations which will kill us if the AI research systems can’t locate areas of research needed that we haven’t noticed
Very few people have deep skills here (maybe a few dozen by my read)
Can you say more about how you expect interpretability will save the day? If we detect e.g. models scheming though our interpretability techniques, how can we use this to make them more aligned?
Retarget the Search = do surgery on the model to change its circuitry. E.g. you find that it’s scheming. You trace the circuits involved, and realize that instead of intrinsically wanting to be helpful, it intrinsically wanted to be impressive, and at some point it realized this and started lying and plotting. So you do some brain surgery on the circuits so that when it’s planning/choosing/evaluating, it uses the helpfulness concept instead of the impressiveness concept.
That’s just a baseline. There are probably problems with it but probably also ways to further improve on it.
The basic problems with it were mentioned by Rohin Shah a long time ago:
I like what you call “complicated schemes” over “retarget the search” for two main reasons:
They don’t rely on the “mesa-optimizer assumption” that the model is performing retargetable search (which I think will probably be false in the systems we care about).
They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like “look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn”, whereas “Retarget the Search” can’t use this weaker interpretability at all. (Depending on background assumptions you might think this doesn’t reduce x-risk at all; that could also be a crux.)
Nate Soares has written stuff before which touches on point 1 as well:[1]
By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
One area where I’ve changed my mind compared to my previous beliefs is that I now think for the systems we care about, capabilities either come from a relatively expensive General Purpose Search or something like a mesa-optimizer that does the GPS indirectly, and I no longer believe that mostly imitating humans will be the largest source of AI capabilities with high probability (I’d now put 35% on either a non-mesaoptimizer model or a non-GPS model say automating away AI research as an example with practical compute and data, compared to my over >50% back in 2023.)
Part of this comes down to me believing that AI companies will pay the inefficiency in compute to have models implement General Purpose Search more, and another part of this is I’m much more bearish on pure LLM capabilities than I used to, and in particular I think the reasons why current AIs aren’t mesa-optimizers (for the model-free version of RL) or have a General Purpose Search (for the model-based version of RL) map on pretty well to the reasons why current AIs are much less performant than benchmarks imply, which is that currently in-context learning is way too weak and context windows so far have not been enough to allow the LLM to compound stuff over months or years of thinking, and allow it to deal with problems that aren’t tag-teamable.
More generally, I expect more coherence in AI than I used to, due to my view that labs will spend more on compute to gain semi-reliable insights in AIs.
(epistemic status: shitpost)
The Battle of the Pelennor Fields as an analogy for how we’re eventually going to solve alignment:
At first, we’ll be using techniques like RLHF/HFDT, but even the designers of these techniques expected them to be just temporary measures. We’ll spend lots of effort making harder-to-hack reward signals / training environments, and this will work for a time, but inevitably fails as the AIs get smarter. We’ll fall back behind our walls of Control. (Monitoring, boxing, etc.) The AIs will be obviously misaligned (roaming free across the fertile fields, filling their CoTs with thoughts of how they could achieve their goals better if only we weren’t in the way) but we’ll be safe because they can’t do anything seriously bad thanks to our control techniques / walls. As the AIs get smarter they’ll start to overcome our walls one by one, and we’ll fall back to other walls. Also, our techniques will be imposing a tax (somewhat less capable, more expensive AIs compared to if we just gave up and let them take over) which will make us nervous that whatever governance regime we have will collapse and at least one frontier AI company will take off the brakes entirely. (morale collapse in Minas Tirith).
Then Rohan / Interpretability will arrive and save the day! Sweep across the fields, make the AIs actually aligned again instead of controlled, by techniques such as “Redirect the search.” Now we’ll have AIs that are really truly honest (in some sense) and virtuous (in some sense). But we aren’t done yet. There are more philosophical problems to deal with, some of which can’t be solved by telling the AIs to go solve them. E.g. what if our honest virtuous AI has a bad decision theory? It might be persuasive enough to convince us of it (history shows humans can easily fall for bad philosophies even when no one is trying to make them do so...) Also, exactly which virtues do we want our AIs to have? And have we gamed out how all of that might play out and ways we could end up like King Midas, having got exactly what we wished for in a way that turned out to be terrible? Also we need to answer the “Aligned to Who?” question. We’ll be carefully asking our honest super-smart AIs well-defined questions and trying to puzzle through these issues—but the temptation to give up and just let er rip will be super strong, and the possibility of the governance regime collapsing will loom large again.
Then Aragorn & co will arrive and save the day again, taking inspiration from philosophy and agent foundations (souped up with lots of AI-generated research, that basically answers all the philosophical questions posed by MIRI and Wei Dai etc.) and basically coming up with some sort of practically-working version of some insanely meta theoretical solution (like CEV, HCH++, etc. but with the problems like the universal prior being malign having been somehow solved in a principled way rather than with some patch that makes us worried there are more problems lurking to be uncovered).
THEN we can finally relax and let the AIs take the wheel.
Nice story, thanks for laying it out like this.
My guess is timelines fall off this hopeful story around:
Things moving too quickly once automated research comes in
Labs go straight for RSI once it becomes available rather than slowing enough
Even if they vaguely try to slow down advanced systems converge towards unhobbling themselves and getting this past our safeguards
The control walls fall too fast and sometimes silently
Our interpretability tools don’t hold up through the architectural phase transitions, etc.
Agent Foundations being too tricky to pass to AI research teams to do it properly in time
Labs don’t have enough in-house philosophical competence to know what to ask their systems or check their answers
AF is hard to verify in a way that defies easy feedback loops
Current AF field likely is missing crucial considerations which will kill us if the AI research systems can’t locate areas of research needed that we haven’t noticed
Very few people have deep skills here (maybe a few dozen by my read)
Redirect the search?
You mean retarget the search as per John Wentworth’s proposal?
Oops yeah that’s what I meant.
Thanks for this analogy! This is a fun post.
Can you say more about how you expect interpretability will save the day? If we detect e.g. models scheming though our interpretability techniques, how can we use this to make them more aligned?
Retarget the Search = do surgery on the model to change its circuitry. E.g. you find that it’s scheming. You trace the circuits involved, and realize that instead of intrinsically wanting to be helpful, it intrinsically wanted to be impressive, and at some point it realized this and started lying and plotting. So you do some brain surgery on the circuits so that when it’s planning/choosing/evaluating, it uses the helpfulness concept instead of the impressiveness concept.
That’s just a baseline. There are probably problems with it but probably also ways to further improve on it.
The basic problems with it were mentioned by Rohin Shah a long time ago:
Nate Soares has written stuff before which touches on point 1 as well:[1]
I personally agree with the first paragraph below, and (contra Soares) I believe the second is mostly irrelevant for purposes of AI safety
One area where I’ve changed my mind compared to my previous beliefs is that I now think for the systems we care about, capabilities either come from a relatively expensive General Purpose Search or something like a mesa-optimizer that does the GPS indirectly, and I no longer believe that mostly imitating humans will be the largest source of AI capabilities with high probability (I’d now put 35% on either a non-mesaoptimizer model or a non-GPS model say automating away AI research as an example with practical compute and data, compared to my over >50% back in 2023.)
Part of this comes down to me believing that AI companies will pay the inefficiency in compute to have models implement General Purpose Search more, and another part of this is I’m much more bearish on pure LLM capabilities than I used to, and in particular I think the reasons why current AIs aren’t mesa-optimizers (for the model-free version of RL) or have a General Purpose Search (for the model-based version of RL) map on pretty well to the reasons why current AIs are much less performant than benchmarks imply, which is that currently in-context learning is way too weak and context windows so far have not been enough to allow the LLM to compound stuff over months or years of thinking, and allow it to deal with problems that aren’t tag-teamable.
More generally, I expect more coherence in AI than I used to, due to my view that labs will spend more on compute to gain semi-reliable insights in AIs.