If this is what’s going on, then I basically can’t imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying “it’s already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans.” Is there something I’m missing?
I think your interpretation of Nate is probably wrong, but I’m not sure and happy to drop it.
If you use that definition, I don’t understand in what sense LMs don’t “want” things—if you prompt them to “take actions to achieve X” then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn’t that “want” or “desire” like behavior? So what does it mean when Nate says “AI doesn’t seem to have all that much “want”- or “desire”-like behavior”?
I’m genuinely unclear what the OP is asserting at that point, and it seems like it’s clearly not responsive to actual people in the real world saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?” People who say that kind of thing mostly aren’t saying that LMs can’t be prompted to achieve outcomes. They are saying that LMs don’t want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don’t seem to have preferences about the training objective, or that are coherent over time).
If your AI system “wants” things in the sense that “when prompted to get X it proposes good strategies for getting X that adapt to obstacles,” then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying “If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task” + “If your AI wants something, then it will undermine your tests and safety measures” seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.
I definitely don’t endorse “it’s extremely surprising for there to be any capabilities without ‘wantings’” and I expect Nate doesn’t either.
But the OP says:
to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise
This seems to strongly imply that a particular capability—succeeding at these long horizon tasks—implies the AI has “wants/desires.” That’s what I’m saying seems wrong.
When the post says:
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
I don’t buy the story about long-horizon competence—I don’t think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I’d like to see this view turned into some actual predictions, and if it were I expect I’d disagree.
Calling it a “contradiction” or “extreme surprise” to have any capability without “wanting” looks really wrong to me.
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.”
I think this is a semantic motte and bailey that’s failing to think about mechanics of the situation. LM agents already have the behavior “reorient towards a target in response to obstacles,” but that’s not the sense of “wanting” about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked “how can I achieve X in this situation?” will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn’t what you need for AI risk arguments!
I think this post is a bad answer to the question “when are the people who expected ‘agents’ going to update?” I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it’s not really advancing the discussion.
Okay, so you know how AI today isn’t great at certain… let’s say “long-horizon” tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn’t seem to have all that much “want”- or “desire”-like behavior? [...] Well, I claim that these are more-or-less the same fact.
It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?”, so, here we are.
I think that a system may not even be able to “want” things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can’t want things or solve long horizon tasks at all, then maybe you shouldn’t update at all when they don’t appear to want things.
But that’s not really where we are at—AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)
(The foreshadowing example doesn’t seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you’ve already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)
Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between “wanting things” and “ability to solve long horizon tasks” (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it’s not particularly about “ability to solve long-horizon tasks,” and we are obviously getting evidence about it each time we train a new language model.
It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus.
Mostly I think it would be faster and I think a lot less noisy per minute. I also think it’s a bit unrepresentative to be able to use “how well did this advisor’s suggestions work out in hindsight?” to learn which advisors are honest and so it’s nice to make the dishonest advisors’ job easier.
(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research acceleration—e.g. it would be very valuable to just get predictions of which research direction will feel promising to me after spending a day thinking about it. But I think the main open question here is whether some kind of debate or decomposition can add value over and above the obvious big wins.)
For what it’s worth I think using chess might be kind of tough—if you provide significant time, the debaters can basically just play out the game.
I don’t think you need to reliably classify a system as safe or not. You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
I think I’m probably imagining better implementation than you, probably because (based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I’m describing as “very good RSPs” and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that’s up for debate).
So at that point you obviously aren’t talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries—which I don’t even think seems very unrealistic at this point and IMO is totally reasonable for “very good”), and I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much. I’m sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it’s plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.
I think politically realistic hardware controls could buy significant time, or be used to push other jurisdictions to implement appropriate regulation and allow for international verification if they want access to hardware. This seems increasingly plausible given the United States’ apparent willingness to try to control access to hardware (e.g. see here).
Which laxer jurisdictions are poised to capture talent/hardware/etc. right now? It seems like ‘The West’ (interpreted as Silicon Valley) is close to the laxest jurisdiction on Earth when it comes to tech! (If we interpret ‘The West’ more broadly, this no longer holds, thankfully.)
If you implemented a unilateral pause on AI training runs in the West, then anyone who wasn’t pausing AI would be a much laxer jurisdiction.
Regarding the situation today, I don’t believe that any jurisdiction has regulations that meaningfully reduce catastrophic risk, but that the US, EU, and UK seem by far the closest, which I’d call “the West.”
I assume your caveat about ‘a pause on new computing hardware’ indicates that you think that business-as-usual capitalism means that pausing capital-intensive frontier development unilaterally doesn’t buy much, because hardware (and talent and data etc.) will flow basically-resistance-free to other places? This seems like a crux: one I don’t feel well-equipped to evaluate, but which I do feel it’s appropriate to be quite uncertain on.
I think a unilateral pause in the US would slow down AI development materially, there is obviously a ton of resistance. Over the long term I do think you will bounce back significantly to the previous trajectory from catch-up growth, despite resistance, and I think the open question is more like whether that bounce back is 10% or 50% or 90%. So I end up ambivalent; the value of a year of pause now is pretty low compared to the value of a year of pause later, and you are concentrating development in time and shifting it to places that are (by hypothesis) less inclined to regulate risk.
I don’t think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I’m not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that’s a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it’s much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).
The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.
(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I’d expect them to be much more optimistic about this step than you appear to be.)
(I’m guessing you’re not assuming that every lab in the world will adopt RSPs, though it’s unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)
I don’t think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I’m absolutely imagining international coordination to regulate AI development.
In terms of “mistakes in evals” I don’t think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren’t just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.
unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.
I don’t really believe this argument. I guess I don’t think situations will be that “normal-ish” in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between “low effort” and “high effort” which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can’t solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.
Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.
On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.
Small caveats: (i) I don’t know enough to understand the implications or comment on the recommendation “they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented,” (ii) “take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next” seems like a bit of a severe understatement that might undermine urgency (I think we should that possibility seriously over the next few years, and I’d give better than even odds that they will outperform humans across all critical domains within this decade or next), (iii) I think that RSPs / if-then commitments are valuable not just for bridging the period between now and when regulation is in place, but for helping accelerate more concrete discussions about regulation and building relevant infrastructure.
I’m a tiny bit nervous about the way that “autonomous replication” is used as a dangerous capability here and in other communications. I’ve advocated for it as a good benchmark task for evaluation and responses because it seems likely to be easier than almost anything catastrophic (including e.g. intelligence explosion, superhuman weapons R&D, organizing a revolution or coup...) and by the time it occurs there is a meaningful probability of catastrophe unless you have much more comprehensive evaluations in place. That said, I think most audiences will think it sounds somewhat improbable as a catastrophic risk in and of itself (and a bit science-fiction-y, in contrast with other risks like cybersecurity that also aren’t existential in-and-of-themselves but sound much more grounded). So it’s possible that while it makes a good evaluation target it doesn’t make a good first item on a list of dangerous capabilities. I would defer to people who have a better understanding of politics and perception, I mostly raise the hesitation because I think ARC may have had a role in how focal it is in some of these discussions.
Unknown unknowns seem like a totally valid basis for concern.
But I don’t think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don’t think “unknown unknowns could cause a catastrophe” is enough to convince governments (or AI developers) to take significant actions.
I think RSPs make this situation better by pushing developers away from vague “Yeah we’ll be safe” to saying “Here’s what we’ll actually do” and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.
My own take is that there is small but non-negligible risk before Anthropic’s ASL-3. For my part I’d vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I’m not the median voter or decision-maker here (nor is Anthropic), and so I’ll say my piece but then move on to trying to convince people or to find a compromise that works.
Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.
I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I usually caveat that doing so seems costly and politically challenging and so I expect it to require clearer evidence of risk.
That’s fair, I think I misread you.
I guess our biggest differences are (i) I don’t think the takeaway depends so strongly on whether AI developers are trying to do the right thing—either way it’s up to all of us, and (ii) I think it’s already worth talking about ways which Anthropic’s RSP is good or bad or could be better, and so I disagree with “there’s probably not much to say at this point.”
But I also suspect that people on the more cynical side aren’t going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there’s probably not much to say at this point other than, let’s see what happens next.
This seems wrong to me. We can say all kinds of things, like:
Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
Is there external verification that they are implemented well?
Which developers have and have not implemented effective and verifiable RSPs?
How could employees, the public, and governments push developers to do better?
I don’t think we’re just sitting here and rolling a die about which is going to happen, path #1 or path #2. Maybe that’s right if you just are asking how much companies will do voluntarily, but I don’t think that should be the exclusive focus (and if it was there wouldn’t be much purpose to this more meta discussion). One of my main points is that external stakeholders can look at what companies are doing, discuss ways in which it is or isn’t adequate, and then actually push them to do better (and build support for government action to demand better). That process can start immediately, not at some hypothetical future time.
The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.
The post mentions a “failsafe” where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I’m not aware of any public information about what that supermajority is, or whether there are other ways the Trust’s formal powers could be reduced.
Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it’s also listed plenty of other places.)
We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).
I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on ε is not solvable. We would award a prize for an algorithm running in time O(mnε−1) which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least ε. And of course a lower bound is still fair game.
That said, I don’t expect any new submissions to win prizes and so wouldn’t recommend that anyone start working on it.