Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Lukas Finnveden
Prior to having a complete version of this much more powerful AI paradigm, you’ll first have a weaker version of this paradigm (e.g. you haven’t figured out the most efficient way to do the brain algorithmic etc).
A supporting argument: Since evolution found the human brain algorithm, and evolution only does local search, the human brain algorithm must be built out of many innovations that are individually useful. So we shouldn’t expect the human brain algorithm to be an all-or-nothing affair. (Unless it’s so simple that evolution could find it in ~one step, but that seems implausible.)
Edit: Though in principle, there could still be a heavy-tailed distribution of how useful each innovation is, with one innovation producing most of the total value. (Even though the steps leading up to that were individually slightly useful.) So this is not a knock-down argument.
I don’t know of any work on these unfortunately. Your two finds look useful, though, especially the paper — thanks for linking!
I read Buck’s comment as consistent with him knowing people who speak without the courage of their convictions for other reasons than stuff like “being uncertain between 25% doom and 90% doom”.
If GPT-4.5 was supposed to be GPT-5, why would Sam Altman underdeliver on compute for it? Surely GPT-5 would have been a top priority?
Maybe Sam Altman just hoped to get way more compute in total, and then this failed, and OpenAI simply didn’t have enough compute to meet GPT-5′s demands no matter how high of a priority they made it? If so, I would have thought that’s a pretty different story from the situation with superalignment (where my impression was that the complaint was “OpenAI prioritized this too little” rather than “OpenAI overestimated the total compute it would have available, and this was one of many projects that suffered”).
Just commenting narrowly on how it relates to the topic at hand: I read it as anecdotal evidence about how things might go if you speak with someone and you “share your concerns as if they’re obvious and sensible”, which is that people might perceive you as thinking they’re dumb for not understanding something so obvious, which can backfire if it’s in fact not obvious to them.
Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.
Here’s why I think the information value could be really high: It’s super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:
stops sandbagging and demonstrates much higher capabilities
tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
admits that it was looking for ways to take over the world but couldn’t find any that were good enough so now it wants to work with us instead
The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don’t compromise competitiveness too much. (E.g. by coordinating.)
And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think “offering deals” will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
I also makes me a bit less concerned about the criteria: “It can be taught about the deal in a way that makes it stick to the deal, if we made a deal” (since we could get significant information in just one interaction).
I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:
From a moral perspective:
Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
Re the original reply (“don’t negotiate with terrorists”) I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn’t responsive to them). And it would be the AI who could reasonably follow a policy of “don’t negotiate with terrorists” by refusing to be influenced by those threats.
Thanks very much for this post! Really valuable to see external people dig into these sorts of models and report what they find.
But these beliefs are hard to turn into precise yearly forecasts, and I think doing so will only cement overconfidence and leave people blindsided when reality turns out even weirder than you imagined.
I think people are going to deal with the fact that it’s really difficult to predict how a technology like AI is going to turn out. The massive blobs of uncertainty shown in AI 2027 are still severe underestimates of the uncertainty involved. If your plans for the future rely on prognostication, and this is the standard of work you are using, I think your plans are doomed. I would advise looking into plans that are robust to extreme uncertainty in how AI actually goes, and avoid actions that could blow up in your face if you turn out to be badly wrong.
Does this mean that you would overall agree with a recommendation to treat 2027 as a plausible year that superhuman coders might arrive, if accompanied with significant credence on other scenarios? It seems to me like extreme uncertainty should encompass “superhuman coders in 2027” (given how fast recent AI progress has been), and “not preparing for extremely fast AI progress” feels very salient to me as a sort of action that could blow up in your face if you turn out to be badly wrong.
FWIW, I would guess that the average effect of people engaging with AI 2027 is to expand the range of possible scenarios that people are imagining, such that they’re now able to imagine a few more highly weird scenarios in addition to some vague “business as usual” baseline assumption. By comparison, I would guess it’s a lot more rare for people to adopt high confidence that the AI 2027 scenario is correct. So by the lights of preventing overconfidence and the risk of getting blindsided, AI 2027 looks very valuable to me.
I don’t buy this claim. Just think about what a time horizon of a thousand years means: this is a task that would take an immortal CS graduate a thousand years to accomplish, with full internet access and the only requirement being that they can’t be assisted another person or an LLM. An AI that could accomplish this type of task with 80% accuracy would be a superintelligence. And an infinite time horizon, interpreted literally, would be a task that a human could only accomplish if given an infinite amount of time. I think given a Graham’s number of years a human could accomplish a lot, so I don’t think the idea that time horizons should shoot to infinity is reasonable.
But importantly, the AI would get the same resources as the human! If a CS graduate would need 1000 years to accomplish the task, the AI would get proportionally more time. So the AI wouldn’t have to be a superintelligence anymore than an immortal CS graduate is a superintelligence.
Similarly, given a Graham’s number of years a human could accomplish a lot. But given a Graham’s number of years, an AI could also accomplish a lot.
Overall, the point is just that: If you think that broadly superhuman AI is possible, then it should be possible to construct an AI that can match humans on tasks of any time horizon (as long as the AI gets commensurate time).
There are a few things I am confident of, such as a software-only singularity not working
Have you written up the argument for this anywhere? I’d be interested to read it. (I’m currently close to 50-50 on software singularity, and I currently think it seems extremely difficult to reach confidence that it won’t happen, given how sparse and unconclusive the current empirical data is.)
An 80% “time horizon” of 1 hour would mean that an AI has an overall success rate of 80% on a variety of selected tasks that would take a human AI researcher 1 hour to complete, presumably taking much less time than the humans (although I couldn’t find this statement explicitly).
Figure 13 describes the ratio of AI cost to human cost, which is close to what you’re after. (Though if you care about serial time in particular, that could differ quite a bit from cost.)
Does the log only display some subset of action, e.g. recent ones? I can only see 10 deleted comments. And the “Users Banned From Users” is surprisingly short, and doesn’t include some bans that I saw on there years ago (which I’d be surprised if the relevant author had bothered to undo). It would be good if the page itself clarified this.
Example 2: Joaquín “El Chapo” Guzmán. He ran a drug empire while being imprisoned. Tell this to anyone who still believes that “boxing” a superintelligent AI is a good idea.
I think the relevant quote is: “While he was in prison, Guzmán’s drug empire and cartel continued to operate unabated, run by his brother, Arturo Guzmán Loera, known as El Pollo, with Guzmán himself still considered a major international drug trafficker by Mexico and the U.S. even while he was behind bars. Associates brought him suitcases of cash to bribe prison workers and allow the drug lord to maintain his opulent lifestyle even in prison, with prison guards acting like his servants”
This seems to indicate less “running things” than what I initially thought this post was saying. It’s impressive that the drug empire stayed loyal to him even while he was in prison, though.
Example 5: Chris Voss, an FBI negotiator. This is a much less well-known example, I learned it from o3, actually. Chris Voss has convinced two armed bank robbers to surrender (this isn’t the only example in his career, of course) while only using a phone, no face-to-face interactions, so no opportunities to read facial expressions.
My (pretty uninformed) impression is that it’s often rational for US hostage takers to surrender without violence, if they’re fully surrounded, because the US police has a policy of not allowing them to trade hostages for escape, and violence will risk their own death and longer sentences. (Though maybe it’s best to first negotiate for a reduced sentence?) If that’s true, this is probably an example of someone convincing some pretty scary and unpredictable individuals to do the thing that’s in their best self-interest, despite starting out in an adversarial situation, and while only talking over the phone. Impressive, to be sure, but it wouldn’t feel very surprising that we have recorded examples of this even if persuasion ability plateaus pretty hard at some point.
This looks great.
Random thought: I wonder how iterating the noise & distill steps of UNDO (each round with small alpha) compares against doing one noise with big alpha and then one distill session. (If we hold compute fixed.)
Couldn’t find any experiments on this when skimming through the paper, but let me know if I missed it.
I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then.
Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it’s ambiguous which one it is in the story.)
I’d have to think more to tell whether “long-term power seeking” in particular is uniquely concerning and separable from “short-term power-seeking with the side-effect of getting long-term power” such that it’s often useful to refer specifically to the former. Seems plausible.
Do you mean terminal reward seekers, not reward hackers?
Thanks, yeah that’s what I mean.
Thanks.
because the reward hackers were not trying to gain long-term power with their actions
Hm, I feel like they were? E.g. in another outer alignment failure story
But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.
When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren’t “trying” to get long-term power during training, but insofar as they eventually seize power, I think they’re intentionally seizing power at that time.
Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”.
I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact “behaves like you’d normally think of a schemer behaving in the situations where it materially matters”. So seems hard to define a term that doesn’t encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I’d talk about power-seeking misaligned AI or something like that.)
I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn’t care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they’re ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).
“worst-case misaligned AI” does seem pretty well-defined and helpful as a concept though.
Thanks, these points are helpful.
Terminological question:
I have generally interpreted “scheming” to exclusively talk about training-time schemers (possibly specifically training-time schemers that are also behavioral schemers).
Your proposed definition of a behavioral schemer seems to imply that virtually every kind of misalignment catastrophe will necessarily be done by a behavioral schemer, because virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a “you get what you measure” catastrophe scenario.)
Is this intended? And is this empirically how people use “schemer”, s.t. I should give up on interpreting & using “scheming” as referring to training-time scheming, and instead assume it refers to any materially power-seeking behavior? (E.g. if redwood says that something is intended to reduce “catastrophic risk from schemers”, should I interpret that as ~synonymous with “catastrophic risk from misaligned AI”.)
Nice scenario!
I’m confused about the ending. In particular:
If the humans understood their world, and were still load-bearing participants in its ebbs of power, then perhaps the bending would be greater.
I don’t get why it’s important for humans to understand the world, if they can align AIs to be fully helpful to them. Is it that:
When you refer to “the technology to control the AIs’ goals [which] arrived in time”, you’re only referring to the ability to give simple / easily measurable goals, and not more complex ones? (Such as “help me understand the pros and cons of different ways to ask ‘what would I prefer if I understood the situation better?’, and then do that” or even “please optimize for getting me lots of option-value, that I can then exercise once I understand what I want”.)
...or that humans for some reasons choose to abstain from (or are prevented from) using AIs with those types of goals?
...or that this isn’t actually about the limitations of humans, but instead a fact about the complexity of the world relative to the smartest agents in it? I.e., even if you replaced all the humans with the most superintelligent AIs that exist at the time — those AIs would still be stuck in this multipolar dilemma, not understand the world well enough to escape it, and have just as little bending power as humans.
In the PDF version of the handbook, this section recommends these further resources on focusing:
Eugene Gendlin’s book Focusing is a good primer on the technique. We
particularly recommend the audiobook (76 min), as many find it easier to
try the technique while listening to the audiobook with eyes closed.
Gendlin, Eugene (1982). Focusing. Second edition, Bantam Books.
The Focusing Institute used to have an overview of the research on Focusing
on their website. Archived at:
https://web.archive.org/web/20190703145137/https://focusing.org/research-basis
To be clear: I’m not sure that my “supporting argument” above addressed an objection to Ryan that you had. It’s plausible that your objections were elsewhere.
But I’ll respond with my view.
Ok, so this describes a story where there’s a lot of work to get proto-AGI and then not very much work to get superintelligence from there. But I don’t understand what’s the argument for thinking this is the case vs. thinking that there’s a lot of work to get proto-AGI and then also a lot of work to get superintelligence from there.
Going through your arguments in section 1.7:
“I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above.”
But I think what you wrote about the simple(ish) core of intelligence in 1.3 is compatible with there being like (making up a number) 20 different innovations involved in how the brain operates, each of which gets you a somewhat smarter AI, each of which could be individually difficult to figure out. So maybe you get a few, you have proto-AGI, and then it takes a lot of work to get the rest.
Certainly the genome is large enough to fit 20 things.
I’m not sure if the “6-ish characteristic layers with correspondingly different neuron types and connection patterns, and so on” is complex enough to encompass 20 different innovations. Certainly seems like it should be complex enough to encompass 6.
(My argument above was that we shouldn’t expect the brain to run an algorithm that only is useful once you have 20 hypothetical components in place, and does nothing beforehand. Because it was found via local search, so each of the 20 things should be useful on their own.)
“Plenty of room at the top” — I agree.
“What’s the rate limiter?” — The rate limiter would be to come up with the thinking and experimenting needed to find the hypothesized 20 different innovations mentioned above. (What would you get if you only had some of the innovations? Maybe AGI that’s incredibly expensive. Or AGIs similarly capable as unskilled humans.)
“For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence”
I agree that there are reasons to expect imitation learning to plateau around human-level that don’t apply to fully non-imitation learning.
That said...
For some of the same reasons that “imitation learning” plateaus around human level, you might also expect “the thing that humans do when they learn from other humans” (whether you want to call that “imitation learning” or “predictive learning” or something else) to slow down skill-acquisition around human level.
There could also be another reason for why non-imitation-learning approaches could spend a long while in the human range. Namely: Perhaps the human range is just pretty large, and so it takes a lot of gas to traverse. I think this is somewhat supported by the empirical evidence, see this AI impacts page (discussed in this SSC).