Contact me at alex@controlai.com
Alex Amadori
control granularity might be higher, and permanence might be higher
That’s pretty much it, just to an extreme extent. Given that ASI could be extremely powerful and that it’s really hard to predict what it could do, I recommend thinking of it as:
The control granularity is basically infinite (the state can read your mind, persuade you to take arbitrary actions, predict what you’ll do long in advance, etc.)
The permanence is basically infinite (eg. until the heat death of the universe)
I keep trying to map this onto Canada’s situation and getting stuck. We’re mid-negotiation with the US on trade and political capital is finite. How does leadership spend it on ASI risks most Canadians aren’t thinking about?
IMO the key here is “most Canadians aren’t thinking about”, this can be changed through awareness campaigns. Most people aren’t aware that AI companies are shooting for ASI, and wouldn’t like it if they knew.
First, timing … If US or Chinese leadership believes ASI is 2-5 years out, they’ll absorb enormous economic costs for a shot at decisive advantage
I think this is reasonable, which is why we include the more extreme measures including recognition of the right to self-defense. I personally would be surprised if we could throw this together so quickly that none of the conditional deterrence measures ever need to be activated...In darker timelines, I think the more extreme economic measures could slow down the superpower AI programs and give time for middle powers to get more serious with their military deterrence, which has a good chance of being effective imo.
The US has extensive tools for pressuring middle powers to defect. The proposal assumes coalition members absorb retaliation costs collectively, but the US can apply pressure bilaterally in ways that make early defection attractive. China has its own methods, but I can only speak confidently about America.
This is a good point. We didn’t have time to address this in the first version of the proposal, but there are potentially some mitigations that can be implemented here, like very heavy penalties for defecting from the agreement.
At the end of the day though… you just must get deep buy in about the x-risks of ASI among middle powers (including softer ones like the possibility of a permanent US and China singleton).
More superficial motivations could be easy to break, but I think it would be difficult to tempt a country where the relevant decision makers think the best case scenario for ASI is for one’s state to be completely dismantled by a US singleton (effectively if not literally).
How middle powers may prevent the development of artificial superintelligence
Modeling the geopolitics of AI development
Very well written horror story! Props :)
Believe what?
That’s kind of the point… people reject premises all the time. And I don’t mean in the “this premise seems unrealistic” sense. I mean more in the “I refuse to participate in this thought experiment!” sense. Even when the point wasn’t to give a lesson about the shape of reality but to give a lesson about the shape of the reader’s mind and how they respond to the thought experiment.
People just hate inspecting their own mind with a passion. It’s also common for people not to trust you when you suggest a thought experiment if they can’t see where it’s going. It’s very easy to get an anger reaction this way (literally raised voice, tense muscles, raised heartbeat type of reaction).
Three main views on the future of AI
For concreteness, let’s say that the world model requires a trillion (“N”) bits to specify, the intended head costs
10,000 bits, and the instrumental head costs 1,000 bits. If we just applied a simplicity prior directly, we expect to spend N + 1,000 bits to learn the instrumental model rather than N + 10,000 bits to learn the intended model. That’s what we want to avoid.Not sure if I’m misunderstanding this, but it seems to me that if it takes 10,000 bits to specify the intended head and 1000 bits to specify the instrumental head, that’s because the world model—which we’re assuming is accurate—considers humans that answer a question with a truthful and correct description of reality much rarer than humans who don’t. Or at least that’s the case when it comes to the training dataset. 10,000 − 1000 equals 9,000, so in this context “much rarer” means 2^{9,000} times rarer.
However,
Now we have two priors over ways to use natural language: we can either sample the intended head at random from the simplicity prior (which we’ve said has probability 2^{-10,000} of giving correct usage), or we can sample the environment dynamics from the simplicity prior and then see how humans answer questions. If those two are equally good priors, then only 2^{-10,000} of the possible humans would have correct usage, so conditioning on agreement saves us 10,000 bits.
So if I understand correctly, the right amount of bits saved here would be 9,000.
So now we spend (N/2 + 11,000) + (N/2 − 10,000) bits altogether, for a total of N + 1,000.
Unless I made a mistake, this would mean the total is N + 2,000 - which is still more expensive than finding the instrumental head.
This sounds to me like a very weak excuse. If you change your mind on something this important (for example, you are now confident alignment is easy despite expecting RSI in a couple of years and doing your best to accelerate it), you had better say so very clearly and very publicly.
This is what a company / leadership / CEO etc. would do if they had a somewhat strong deontology. Just from observing this lack of candor, one should consider Anthropic to be impossible to coordinate with, which is a) not what you want from a frontier AI company and b) clearly justifies calling it “untrustworthy” unless we want to be nitpicky with language, to a degree that IMO is clearly unnecessary.
In this situation, since they lied in the past, we’re way past the point where we outsiders (including people who work at Anthropic but can’t read the mind of leadership!) can evaluate whether we disagree with Anthropic about critical matters like how difficult alignment is, and try to change their mind (or at least not work for them) if we think they are wrong.
We’re in a situation where Anthropic should be considered adversarial. Even if tomorrow they released a statement that they now think alignment is easy, I can’t take that statement at face value. Maybe they think alignment is easy; maybe they decided it’s impossible to coordinate with anyone and they will lower risk by an epsilon if they are the first to get DSA; maybe they think it’s better if everyone dies than to let China win the race; maybe they just think it’s fun to build ASI and lied to themselves to justify doing it; who knows.
Against an adversarial opponent, we use POSIWID. We assume there is a hidden goal and try to guess the simplest hidden goal from actions while ignoring statements. What is the simplest hidden goal we can guess? Anthropic doesn’t want to be regulated and want to stay in the lead. I can predict that they will keep taking actions that make regulation harder and make them stay in the lead.