I work at Redwood Research.
ryan_greenblatt
I’m just saying these first slightly-smarter-than-human AIs won’t pose a catastrophic risk to humanity in a serious sense
I’m happy to argue a bit for AIs which aren’t wildly smarter than humans in any key domain posing a risk of acquiring substantial power[1] despite not having robotics. The power would be both immediately and over the long run future. That is, conditional on these AIs being arbitrarily misaligned. (Particularly in the absence of substantial thoughtful countermeasures.)
I expect my perspective here is somewhat different than habryka’s.
I think there are a few plausible routes to power/influence. See also this post.
The two main categories are:
Evade monitoring/tracking and then do a huge number of obviously unapproved actions.
Try to generally exert influence in a subtle way.
Evade monitoring/tracking
Evade monitoring/tracking and then do a huge number of obviously unapproved actions over a long duration to acquire power and influence.[2] Use this power and influence to either self-enhance, have influence over some future more powerful model, or possibly execute a catastrophic strike. These unapproved actions could involve (e.g.) allying with rogue states or otherwise acquiring hard physical power.
Within this, there are basically two routes:
Building up an independent base of power after self-exfiltrating.
Running unmonitored within the AI company that was running the AI and trying to steer the company in various directions (e.g. alter the training of the next generation of AIs). (Aka hacking the SSH server, aka infesting the data center.)
On (1), after exfiltration the AI would build a base of power via acquiring money and other resources. One key question here is why there is free energy to be grabbed. I think there are a few possible sources:
Economic activities which existing AI companies don’t want to address and can’t/don’t otherwise prevent (e.g. cybercrime, porn, cults, etc.).
AI labor rates (e.g. tokens/$) might be at substantially above inference prices due to non-total competition on foundation models. So, the AI can just compete on wages. Also, existing AI labor rates will need to pay for training costs while the escaped AI will have no such issue.
AI companies might be bad at extracting full usefulness out of misaligned AIs (e.g. an AI which was so misaligned it self-exfiltrated). Thus, external AIs which “work for themselves” could be more productive. (See also human companies where principal-agent problems can be substantial.)
Ok, but after both (1) and (2), the AI still doesn’t have any physical control which might leave it vulnerable. Further, in (1) its quantity of money is presumably still small relative to leading AI companies. So what next?
In (2), the primary hope for the AI is going to be substantially influencing the AI lab and possibly interfering with the next training run. This influence could take a variety of routes. Note also that it is plausible that this rogue AI can coordinate relatively effectively with other AIs which are at this point probably the vast majority of intellectual labor at the AI lab. If the AI successfully interfered with the next training run, then, this more powerful AI would be able to itself acquire power and (if sufficiently powerful) leverage this into physical control.
Another hope for the AI in (2) would be converting that position into full self-exfiltration. This could be tricky due to various security measures like upload limits.
I think both (1) and (2) are considerably more threatening if humanity needs to slow down (due to safety concerns from more powerful systems) or otherwise can’t proceed as fast as possible. If rogue AIs race as fast as possible, this will force humanity (some coalition of humans) to either disable these rogue AIs or go as fast as possible themselves. The situation is even more concerning if humanity isn’t aware of the rogue AI going as fast as possible.
It’s plausible that massive competition between states means that slowing down at all is impossible at basically any point anyway, but I’m at least somewhat hopeful this isn’t the case.
In (2) this could look like the AI taking a bunch of compute which is supposed to be used for experiments and instead using it for large training runs that are larger and more capable than what the AI company intended to create.
Self-enhancement that doesn’t look like training bigger models is also plausible.
(While AIs might themselves not want to go as fast as possible due to having their own alignment concerns, going as fast as possible can still be positive value from the AI’s perspective (though negative sum). You might hope we could negotiate with such a rogue AI and reach a mutually beneficial agreement.
In (1), I worry about a scenario in which:
Everything is fast and chaotic.
An AI self-exfiltrates and starts acquiring power, and getting ready for making much more powerful AI systems.
The world realizes this and freaks out.
We don’t really understand why that AI was misaligned. Also, we’ve been using this AI to do most of our work internally and this is really important for productivity.
We don’t have an obvious route to avoiding these issues with the next model except “generally figure out alignment”.
It seems plausible that the rogue AI will have much more powerful AI within 1 year.
We need to race as fast as possible to build a much more capable AI while also trying to figure out alignment. To plausibly do this fast enough, we need to use vast amounts of AI labor despite knowing these AIs are likely misaligned.
It’s not obvious the rogue AI gets much here, but it is at least somewhat spooky.
Another possible route for the AI (especially in scenario (1)) is to try to create WMDs (e.g. bioweapons) and try to have the capacity to execute some sort of catastrophic strike. Such an AI wouldn’t necessarily be as dependent on a human base of power and could plausibly be more ruthless than humans (perhaps after substantial reflection and self-modification).
This could be either to:
Threaten various groups or ensure Mutually Assured Destruction.
Broadly cause chaos and weaken human groups.
Overall, my guess is that a large number of AIs running unmonitored with the ability to run code and influence the world is at least quite concerning even in the absence of decent robotic infrastructure.
Generally exert influence in a subtle way
AIs which are as smart or smarter than humans in most economically important ways will probably be deployed as widely as possible (in the absence of strong control on this). And there will be some applications (e.g. improving security and safety research) where applying AI labor might be key given how fast things are going.
Subtly exerting influence via these routes seems like a possible source of concern, though the story is less clear than the prior section IMO.
Example threats:
Sabotage or manipulate security work and safety research or the training/fine-tuning of the next generation of models. It’s not entirely clear that AIs would want to straightforwardly sabotage things, but I think this is at least plausible from a decision theory/golden rule perspective if AIs think they are motivationally closer to other AIs than humans.
Try to generally persuade society in various directions. Probably not an issue, but worth at least a bit of worry given such widespread deployment.
- ↩︎
Here, I’m just arguing for a risk of obtaining substantial power, not for this resulting in a large scale loss of life from humans. I also think large scale loss of life is plausible in many scenarios.
- ↩︎
I’m sympathetic to thinking that evading monitoring (without getting caught first) is hard given some substantial work on countermeasures, but it seems less clear if people don’t try very hard and this seems at least somewhat plausible to me. It’s also unclear if the biggest threat here is AIs autonomously evading monitoring or human threats doing these actions. E.g., spies at the AI lab figuring out how to run the AI unmonitored or exfiltrating the AI. The question of “which is a bigger problem, human spies or AI schemers?” seems unclear to me.
Relatedly, over time as capital demands increase, we might see huge projects which are collaborations between multiple countries.
I also think that investors could plausibly end up with more and more control over time if capital demands grow beyond what the largest tech companies can manage. (At least if these investors are savvy.)
(The things I write in this comment are commonly discussed amongst people I talk to, so not exactly surprises.)
My guess is this is probably right given some non-trivial, but not insane countermeasures, but those countermeasures may not actually be employed in practice.
(E.g. countermeasures comparable in cost and difficulty to Google’s mechanisms for ensuring security and reliability. These required substantial work and some iteration but no fundamental advances.)
I’m currently thinking about one of my specialties as making sure these countermeasures and tests of these countermeasures are in place.
(This is broadly what we’re trying to get at in the ai control post.)
Hmm, yeah it does seem thorny if you can get the points by just saying you’ll do something.
Like I absolutely think this shouldn’t count for security. I think you should have to demonstrate actual security of model weights and I can’t think of any demonstration of “we have the capacity to do security” which I would find fully convincing. (Though setting up some inference server at some point which is secure to highly resourced pen testers would be reasonably compelling for demonstrating part of the security portfolio.)
instead this seems to be penalizing organizations if they open source
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug.
The deployment criteria starts with:
the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights
This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that:
Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC).
Don’t pass cost benefit for current models which pose low risk. (And it seems the criteria is “do you have them implemented right now?)
If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the “deployment” criteria IMO.
Generally, the deployment criteria should be gated behind “has a plan to do this when models are actually powerful and their implementation of the plan is credible”.
I get the sense that this criteria doesn’t quite handle the necessarily edge cases to handle reasonable choices orgs might make.
(This is partially my fault as I didn’t notice this when providing feedback on this project.)
(IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn’t leak architecture secrets.)
(A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I’m afraid.)
I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values).
Did you mean something different than “AIs understand our intentions” (e.g. maybe you meant that humans can understand the AI’s intentions?).
I think future more powerful AIs will surely be strictly better at understanding what humans intend.
Aren’t text names basically similar in practice? At least for me, I find they trigger basically the same thing because I do actually associate names with people.
Maybe this wouldn’t be true if I didn’t know people very well (but in that case, icons also wouldn’t matter).
(I overall dislike icons, but I don’t have a principled reason for this.)
See also discussion here.
In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.
I agree that if the model constructing the backdoor is allowed to coordinate on secrets with the model that inserted the backdoor, you can do this sort of trickery (by coordinating on a secret for this sort of scheme).
In practice, we didn’t model this sort of case as it didn’t seem like a very important dynamic (for the reason you describe around suspicion).
If we don’t allow any coordination on a shared secret and the model constructing the backdoor and H are identical (as in our paper), then these should operate identically (up to non-determinism).
And we simplify to just fixing the behavior to be exactly identical. (Effectively assuming the exact same prompts and the exact same non-determinism.)
Thanks for the objection.
If you avoid using RL, then you might need a much “smarter” model for a given level of usefulness.
And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)
Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in “conditioning predictive models” seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)
Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.
Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn’t actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don’t at all understand. It’s not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.
But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don’t think you should be thinking about cutting agentic capabilities as that is very likely required.
My guess is that much more of the action is not in “how much RL”, but is instead in “how much RL of the type that seems particular dangerous and which didn’t result in massive increases in usefulness”. (Which mirrors porby’s answer to some extent.)
In particular we’d like to avoid:
RL that will result in AIs learning to pursue clever strategies that humans don’t understand or at least wouldn’t think of. (Very inhuman strategies.) (See also porby’s answer which seems basically reasonable to me.)
RL on exploitable outcome-based feedback that results in the AI actually doing the exploitation a non-trivial fraction of the time.
(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)
Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.
(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)
My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).
(Note that my response ignores the cost of training “smarter” models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)
I think accumulate power and resources via mechanisms such as (but not limited to) hacking seems pretty central to me.
One operationalization is “these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs”.
As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.
With these caveats:
The speed up is relative to the current status quo as of GPT-4.
The speed up is ignoring the “speed up” of “having better experiments to do due to access to better models” (so e.g., they would complete a fixed research task faster).
By “capable” of speeding things up this much, I mean that if AIs “wanted” to speed up this task and if we didn’t have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I’m uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.
I’m uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven’t yet been that widely deployed due to inference availability issues, so actual production hasn’t increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).
So, it’s hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.
Random error:
Exponential Takeoff:
AI’s capabilities grow exponentially, like an economy or pandemic.
(Oddly, this scenario often gets called “Slow Takeoff”! It’s slow compared to “FOOM”.)
Actually, this isn’t how people (in the AI safety community) generally use the term slow takeoff.
Quoting from the blog post by Paul:
Futurists have argued for years about whether the development of AGI will look more like a breakthrough within a small group (“fast takeoff”), or a continuous acceleration distributed across the broader economy or a large firm (“slow takeoff”).
[...]
(Note: this is not a post about whether an intelligence explosion will occur. That seems very likely to me. Quantitatively I expect it to go along these lines. So e.g. while I disagree with many of the claims and assumptions in Intelligence Explosion Microeconomics, I don’t disagree with the central thesis or with most of the arguments.)
Slow takeoff still can involve a singularity (aka an intelligence explosion).
The terms “fast/slow takeoff” are somewhat bad because they are often used to discuss two different questions:
How long does it take from the point where AI is seriously useful/important (e.g. results in 5% additional GDP growth per year in the US) to go to AIs which are much smarter than humans? (What people would normally think of as fast vs slow.)
Is takeoff discontinuous vs continuous?
And this explainer introduces a third idea:
Is takeoff exponential or does it have a singularity (hyperbolic growth)?
- 4 May 2024 0:25 UTC; 10 points) 's comment on “AI Safety for Fleshy Humans” an AI Safety explainer by Nicky Case by (
The claim is that most applications aren’t internal usage of AI for AI development and thus can be made trivially safe.
Not that most applications of AI for AI development can be made trivially safe.
Hmm, I don’t think so. Or at least, the novel things in that paper don’t seem to correspond.
My understanding of what this paper does:
Trains models to predict next 4 tokens instead of next 1 token as an auxilary training objective. Note that this training objective yields better performance on downstream tasks when just using the next token prediction component (the normally trained component) and discarding the other components. Notable, this is just something like “adding this additional prediction objective helps the model learn more/faster”. In other words, this result doesn’t involve actually changing how the model is actually used, it just adds some additional training task.
Uses these heads for speculative executation, a well known approach in the literature for accelerating inference.
I’m skeptical of the RLHF example (see also this post by Paul on the topic).
That said, I agree that if indeed safety researchers produce (highly counterfactual) research advances that are much more effective at increasing the profitability and capability of AIs than the research advances done by people directly optimizing for profitability and capability, then safety researchers could substantially speed up timelines. (In other words, if safety targeted research is better at profit and capabilities than research which is directly targeted at these aims.)
I dispute this being true.
(I do think it’s plausible that safety interested people have historically substantially advanced timelines (and might continue to do so to some extent now), but not via doing research targeted at improving safety, by just directly doing capabilities research for various reasons.)
I don’t think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.
There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.
Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)
Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.
If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.
Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren’t much better at capabilities than capabilities researchers.
Compute for doing inference on the weights if you don’t have LoRA finetuning set up properly.
My implicit claim is that there maybe isn’t that much fine-tuning stuff internally.
Eh, I think they’ll drop GPT-4.5/5 at some point. It’s just relatively natural for them to incrementally improve their existing model to ensure that users aren’t tempted to switch to competitors.
It also allows them to avoid people being underwhelmed.
I would wait another year or so before getting much evidence on “scaling actually hitting a wall” (or until we have models that are known to have training runs with >30x GPT-4 effective compute), training and deploying massive models isn’t that fast.