If governments are situationally aware, then they will be aware of the risk of secret AI projects and act on it, given that such projects are tantamount to declaring war (see filter 1). No one is going to allow the possibility of a small actor running such a project. The equilibrium can only either be actually solving the common knowledge problem for real (however costly this might be), or existential war. Though I suppose the alternative is, if the verification is not good enough, but the actors involved think it is, then there’s room for a secret project to happen.
JennaS
Inventing God while no one’s watching? Are labs likely to keep a decisive advantage sufficiently obscure long enough to take over the world? We saw with Mythos that Anthropic announced it very loudly, in close consultation with the US government. We have an executive order that “voluntarily” requires labs to let the US government to review models before release even at this level of capability.
But even if one entity gets the decisive advantage, we’re still just back in the “nightmare singleton” scenario.
Separately, “succeeds at technical alignment” here has to mean that technical alignment is so easy, you can do it under competitive pressures. Do even the “alignment is easy” folk believe that? Or do you mean that the singleton takes over first, then solves alignment? It doesn’t seem like you take over the world and still remain in control with a misaligned ASI...
Would pre-ASI AI be enough to stop every point of espionage, or at least enough of them? Anthropic is still at least a year from SL-4 - not even 5 - and who knows how the other labs are doing. Can it stop espionage that isn’t purely cyber? Do we think other countries don’t have spies in at least one of the major labs? Are AI secrets as well protected as nuclear ones were?
Ultimately, even granting that it could be physically possible that a) you could get a surprise decisive advantage, b) solve technical alignment before then, c) ward off all espionage attempts before then, and d) the resulting singleton acts to humanity’s best interests—is the conjunction of all those a high enough probability to be termed “not quite as hopeless”?
One way to think about it is, AI can be made to have enough short term alignment that we are incentivized to hand over long-term powers to it, before we’ve solved the problem of long term alignment. If those powers are long term enough, that handover would be irreversible.
Something I’ve thought before is that it seems like most people are rolling their own conclusion about the political feasibility of pausing. They think about it for five minutes or less, and then they’re done; they decide whether building a US pause coalition sounds reasonable, or whether China could be cooperated with, mostly on priors. There’s no Rootclaim for the politics of an AI pause. No one org owns a pipeline for doing this research, not even for the narrow version of message testing. There’s just PauseAI, ControlAI, and StopAI doing their own scattered advocacy efforts live, with close to no support. Vastly influential decisions that caused hundreds of millions to flow to technical safety, and at most a few million to advocacy, were made mostly on vibes. No one even did A/B testing or focus groups for pause messaging until last year!
Mind that I don’t mean the technical implementation or effects of a pause, like with what MIRI does or what the 2023 AI Pause Debate did. I mean whether it seems politically achievable, whether it’s even in the Overton window, or if the window can be moved there. We saw with how PEPFAR was founded that sometimes a political miracle can just happen if there’s the will for it.
I think the question of “is a pause politically feasible” should have an adversarial collaboration done on it. That’s one of the best methods of truth-finding I know of, and it’s sad we don’t do more of it.
Prospectus link is broken, fyi.
1) I’ve read a number of EA Forum posts talking about how to do common apps for hiring. One complaint that stuck out to me in one such review was that orgs have a really hard time trusting other orgs’ interview processes.
2) This kind of already exists, I think?
Would you be willing to give some examples of the things you have heard since the 2023 dialogue? I believed in ControlAI enough to apply for a position there, so this feels important to me.
I think it makes sense, but I also think that focusing on evals isn’t going to help without first actually having a governance regime that cares about them, rather than it being eval outfits completely dependent on frontier lab funding, or government orgs that aren’t willing to say “extinction risk” out loud. To get to that point, we need the force of the democratic process on our side, and a real pause. Else, there will be no time or willpower to do careful evals like you describe; eval outfits only get a few days to a few weeks to do their work before a model is released, and there are too many incentives for decision makers to downplay or ignore bad news such as high eval awareness. I think if you were to get what you wish for, we’d have to already be in a competent slowdown regime.
I also think that a noticeable number of bad futures are locked in before evals even happen, during internal deployment or even development. Careful evals do nothing there; you would need to basically solve alignment for those, and no one has a credible plan for that that scales to arbitrary levels of intelligence.
To summarize,
“Evals don’t represent the real world, and are increasingly recognized as evals by AIs. And, capabilities we don’t know about, and can’t predict or account for from first principles, can arise in the cracks left behind by that fact. So unless we can fix this, there’s a real chance our attempts to coordinate and steer policy go astray because we are blind to what’s going on.”
Does that sound about right?
I’m reminded of this essay about focusing by Henrik Karlsson. You can only focus on 1-2 things at a time in life, maybe 3 if you’re exceptional.
There was some related discussion earlier, see the comments here: https://www.lesswrong.com/posts/oyTQpqBtdawjAnfcm/why-i-am-not-too-worried-about-aipocalypse-scott-alexander
The arguments I was thinking of were, we actually wouldn’t have a good way of identifying Dyson spheres, and we’ve only surveilled ~1% of stars anyways.
There’s another option that was ignored by EA. Consider: instead of funding and staffing yet another frontier lab, EA could’ve directed talent and money towards straightforwardly formulating a plan for a pause on AI research and lobbying Congress to do it. Or even split the difference! There was a period in 2023 where it could’ve happened, and most of the people involved wish in their hearts that a working pause could be real. But basically no one involved with the big EA funders was willing to be persistently candid about it with policymakers, or treat developing a pause plan like a serious research effort instead of just dismissing the idea. What we did get was inside-game thinking—Congressional engagement geared towards “building credibility” with hedged, incrementalist proposals. No one actually tried for the direct ask of “stop,” even investigationally. And now we find with ControlAI that it’s startlingly effective.
As for the championed alternatives to pauses, RSPs and commitments—we’ve found out that as soon as they’re inconvenient, they’re gone. RSPv3 was announced the same day Mythos was deployed internally. And ironically, OpenAI is now pouring a hundred million dollars into lobbying for the opposite of a pause!
So—now we are stuck in this death race hoping that all these safety features being built (many of which also boost capabilities, arguably more than safety) will generalize to superintelligence; that we can get the AI we’re trying to align to do our homework of solving alignment for us; that the labs will actually be able to protect the weights, instead of China stealing them and stripping all the safety features off; and so on. There is still no consensus plan to prevent x-risk from AI; the least risk-averse people are taking unilateral action as they see fit.
As for Dario being amenable to x-risk beliefs, he actively distances himself from doomerism every chance he gets, and seems immune to anyone arguing that anything other than racing is the best plan. He hasn’t shared his model of why he thinks his theory of change actually works, or let it be criticized or displayed willingness to update from anyone more pessimistic than him. He’s engaged with David Sacks more than he has with MIRI in the past five years.
The only reason Mythos moved the White House is because they built the capabilities and proved to the world they existed by using them to find thousands of vulnerabilities. It’s possible the theater around it helped a little bit, but you don’t create a country of top-tier hackers in a datacenter without someone noticing. Little safety was involved in this.
You don’t need to have a take on defense-in-depth vs alignment by design to believe that racing as fast as (super)humanly possible towards recursive self improvement is a horrible idea. But sure, you need to solve the other issues if you solve alignment. It’s just that if you don’t solve alignment first, you’ve already lost.
I don’t think we have anywhere near a guarantee the safety tooling generalizes to superintelligence. Mythos can break out of sandboxes, and future AIs can doubtlessly break out of more. Agentic task harnesses are capabilities, not safety. I don’t trust a racing Anthropic to care enough about signs of sandbagging or deceptive alignment to take meaningful action if it’s too inconvenient for them and all the other evals look fine. And it might not even be up to them, if the government starts making the decisions for them. CoT has been contaminated, and Anthropic has not announced any intent to retrain their models to decontaminate the CoT. Probes are not mature enough to replace it, and I worry the race will cause them to be trained against too, rendering them invalid.
The reason it is a bad idea to do empiricism and trial-and-error on things that can cause x-risk is because you have no guarantee that you will be able to avoid making the error that causes the x-risk, and once you make the error, you can’t take it back. It’s the same reason that experimenting with mirror life or gain-of-function is a terrible idea. Just because AGI research has compelling short term gains, or presents a long term vision of utopia “if only we could solve this one problem,” doesn’t make it any better of an idea.
I understand not liking the idea of having to try to solve alignment without iterating. It sucks! It’s hard! And you wind up sounding like a philosopher lecturing from the ivory tower! But it’s way better than playing Russian Roulette and hoping you don’t go “bang.”
I suppose it depends on what you mean by hope. Is a person who thinks there’s a 50% chance of their project failing, but considers that better than all the alternatives, not hopeful? Or 10%?
What I worry about is, what if the people who make such a seemingly hopeless play are actually right in their worldview, and the people who have higher hopes in their play are wrong? Then a rule that dissuades hopeless people from acting lowers the overall chance of success, and that would be bad.
What if the Filter inevitably happens before you reach astronomical signatures, and doesn’t itself create a signature? Or the signature is brief enough, and civilizations uncommon enough, that 150 years is not enough time to catch an example.
If there is a Great Filter, I’d still rather humans be safe and in control when we get there. Just because the future of anything originating from Earth might be bounded doesn’t mean we still don’t want to capture that future, compared to letting AI convert the Earth to paperclips before the Filter hits.
By the way, there’s this cool map someone made with like ~100 solutions to the Fermi paradox, in case that helps you consider more possibilities: https://www.lesswrong.com/posts/ifW65CHb3d9NWFBvQ/fermi-paradox-solutions-map
Re: Peter Kuhn’s argument, there were several good replies to that post, such as one showing that the deterministic environment he used was actually still nondeterministic, and thus the nondeterministic answer he got wasn’t necessarily roleplaying; or the one that pointed out memory/items in awareness is not consciousness.
Re: your argument, there’s a few things.
-Just because there’s optimization pressure in general doesn’t mean the optimization lands on the simplest thing a planner would. If that were true, LLMs wouldn’t be doing simple addition via a multidimensional helical manifold. Consciousness may be similar; so it can’t be ruled out.
-Even if the optimization does find the simplest solution, it’s still possible consciousness is the most optimal implementation to perform certain tasks, like social cognition or planning, which require self-modelling.
-It’s entirely possible for an LLM to have the kind of cold modeling of characters an human author has, and still have consciousness also.
-Internal experience doesn’t have to be a delicate thing; it might be robustly/diffusely implemented through a lot of things, with niches that resist optimization.
Going point by point:
I: It was 21 years between when North Korea signed the nonproliferation treaty and their first nuclear test. And they were very motivated. Seems to me like the treaty actually did something?
The international community was limited to “expressing concern” only because Russia had nukes. For the current war, their interventions have gone far beyond blocking some bank cards. Large amounts of material support for the war doesn’t seem like “no military intervention” to me. Also, nobody believes the Ukraine war represents an existential threat to humanity; if they did, I think you’d see quite a lot more intervention.
II: Different payoff structures “as most people perceive it”? Most people thinking about AI and national security only see it as an issue of getting a military edge via autonomous weapons and mass surveillance. They do not actually think AI progress could lead to something that can function as a successor species. If they did, I think they would be acting very differently. Getting to point where people believe that seems like a major precursor to any treaty. Also, even if they are AGI-pilled, decision makers may not have a tendency towards “And then we will take over the world with AI! Or kill everyone trying!” in their thinking, compared to the worst fears of rationalists.
III: A political consensus is different from a scientific consensus. National security types may have a considerably different reaction to possibilities of doom than most ML researchers.
IV: I don’t think TACO is a sufficient reason to argue that we are actually incapable of enforcing treaties. Treaties still exist, and often are still being at least partially enforced. Iran sanctions happened for decades and continue to happen. But I do agree that a treaty between nuclear powers with adversarial incentives to defect is a hard problem.
V: Are you sure it’s actually replaceable by geographically diffuse CPU clusters? Isn’t part of the whole reason you have to do datacenters because you need low latency? What % of global CPUs would you need to replicate frontier training efforts? If it’s even close to 1%, that seems like a very detectable datacenter to me.
Even if it is replaceable, somehow—OK—should that be a thoughtstopper? Is there no way to gain traction on the problem?
Though I do think limiting algorithmic efficiency improvement by treaty is something we should be concerned about. Are there bottlenecks on that? Do those require large amount of compute to obtain, or validate? Could a cultural norm against algorithmic improvement be inculcated among scientists? Again, just because something seems hard doesn’t mean it’s impossible.
There’s also the danger that one of the various other AGI efforts outside of LLMs that don’t need large amounts of resources might pay off. That seems pretty scary.
VI: I don’t think a flawed treaty only buys us two years? Given how long it would take China to catch up to us if we stopped, it buys us at least a decade. Unless a different state—Russia, India, UAE? - is willing to buy up all our scientists, buy up a whole lot of compute, and start and sustain their own program for a decade, while no one does anything about it?
This also seems to assume that such programs would be undetectable. But the whole point of the treaty proposal is that massive quantities of compute would be traceable. I suppose if a state sponsored AGI research program were to look for methods that were much less compute intense, that would be pretty scary yes.
It also depends on what you think the relative outcomes are. What if a treaty increases the chance of s-risk from autocracies by 1%, but decreases the chance of s-risk from unaligned ASI from the current AI race by 5%?
Re: your “better way”—this basically rounds out to “massively increase technical safety research.” Yet is safety research safe? This is a whole separate issue, but most safety research tends to scale capabilities as much as or more than it scales safety. RLHF was safety research, and we got ChatGPT out of it. If we’re giving compute and money to anyone who can put the word “safety” in a grant proposal, I don’t think we’re going to actually get differentially safe safety research.
Meanwhile, policy and advocacy have gotten 10-50x less funding than technical safety research has. If you think the solutions we have available for governance are so unworkable, maybe we just have not tried hard enough?
That’s a really compelling comparison between 2007 finance and the AI industry! I’ve never thought of that before.
Don’t AIs regularly do misaligned things where they Move Fast and Break Things and the thing they broke was actually really important? I can easily see an AI deciding to do things that actually turn out to be e.g. white collar crimes. The only reason we haven’t seen AIs’ irresponsibility have effects on the physical world is because they don’t actually affect it very much.
Supposing this was true, what would Anthropic do with it, exactly? As soon as they get caught trying to do anything shady, some nice men with guns are going to come over and tell them no.