Didn’t The Problem try to do something similar by summarizing the essay in the following five bullet points:
The summary
Key points in this document:
Each point is a link to the corresponding section.
Didn’t The Problem try to do something similar by summarizing the essay in the following five bullet points:
The summary
Key points in this document:
Each point is a link to the corresponding section.
As for shrinking time to get alignment right, my worse-case scenario is that someone commits a breakthrough in AGI capabilities research and the breakthrough is algorithmic, not achieved by concentrating the resources, as the AI-2027 forecast assumes.
However, even this case can provide a bit of hope. Recall that GPT-3 was trained by using just about 3e23 FLOP and ~300B tokens. If it was OpenBrain who trained a thousand of GPT-3-scaled models with the breakthrough by using different parts of training data, then they might even be able to run a Cannell-like experiment and determine models’ true goals, alignment or misalignment...
Can only Superintelligent AI be able to escape AI labs and self-replicate itself on other servers?
The RRS has rogue AIs become capable of self-replication on other servers far earlier than Agent-4. The author assumes that these AIs cause enough chaos to have mankind create an aligned ASI. \
why there is so much emphasis on “rogue AIs” plural
Rogue AIs are AIs who were assigned different tasks or outright different LLMs whose weights became public (e.g. DeepSeekV3.1 or KimiK2, but they aren’t YET capable of self-replication). Of course, these AIs find it hard to coordinate with each other.
As for “superintelligent AI being super enlightened and super capable and so fixing all of our complex problems” the superintelligent AI itself is to be aligned with human needs. Agent-4 from the Race Ending is NOT aligned to the humans. And that’s ignoring the possibility that the superintelligent AI who is super enlightened has a vision of mankind’s future which differs from the ideas of human hosts (e.g. Zuckerberg). I tried exploring the results of such a mismatch in my take at writing scenarios.
We would also need to account for the possibility that an AI researcher at Meta or xAI prompts an actual leader to race harder (think of DeepCent’s role in the AI-2027 forecast) or comes up with a breakthrough, initiates the explosion and ends up with Agent-4 who is misaligned and Agent-3 who doesn’t catch Agent-4 because xAI’s safety team doesn’t have a single human competent enough. If this happens, then the company is never oversighted, races as hard as it can and dooms mankind.
However, if Agent-4 is caught, but P(OC member votes for slowdown) is smaller than 0.5 due to the evidence being inconclusive, then the more members the OC has, the bigger p(doom) is. On the other hand, this problem may be arguably solved by adopting the liberum veto on trusting any model...
So a big safety team is good for catching Agent-4, but may be bad for deciding whether it is guilty.
If you would like the LLM to be truly creative, then check out the Science Bench where the problems stump SOTA LLMs despite the fact that the LLMs have read nearly every book on every subject. Or EpochAI’s recent results.
@shanzson, @soycarts, hold the Rogue Replication Scenario...
If your core claim is that some HUMAN geniuses at Anthropic can solve mechinterp, align Claude-N to the geniuses themselves and ensure that nobody else understands it, then this is likely false. While the Race Ending of the AI-2027 forecast has Agent-4 do so, the Agent-4 collective can achieve this by having 1-2 OOMs more AI researchers who also think 1-2 OOMs faster. But the work of a team of human geniuses can at least be understood by their not-so-genius coworkers.[1] Once it happens, a classic power struggle begins with a potential power grab, threats to whistleblow to the USG and effects of the Intelligence Curse.
If you claim that mechinterp could produce plausible and fake insights,[2] then behavioral evaluations are arguably even less useful, especially when dealing with adversarially misaligned AIs thinking in neuralese. We just don’t have anything but mechinterp[3] to ensure that neuralese-using AIs are actually aligned.
Or, if the leading AI companies are merged, by AI researchers from former rival companies.
Which I don’t believe. How can a fake insight be produced and avoid being checked on weaker models? GPT-3 was trained on 3e23 FLOP, allowing researchers to create hundreds of such models with various tweaks in the architecture and training environment by using less than 1e27 FLOP. Which fits into the research experiments as detailed in the AI-2027 compute forecast.
And working with a similar training environment for CoT-using AIs and checking that the environment instills the right thoughts in the CoT. But what if the CoT-using AI instinctively knows that it is, say, inferior in true creativity in comparison with the humans and doesn’t attempt takeover only because of that?
In addition to this, Dr. Roman Jampolsky says the same thing I predicted in my version of AI 2027- someone will use AI to unleash some diabolical virus or bio weapon that gets to is much before superintelligence would, which you can read here which happens in late-2028 in my version of the scenario, which also got comments and attention from Ex-OpenAI researcher Daniel Kokotajlo and co-author of AI 2027.
(sigh) I agree that all Yampolskiy’s arguments that you describe in your post are plausible, especially if p(AI is aligned under best techniques) is low, but here I have to step in. The original scenario’s Race Ending had THE AI decide to take over by using the diabolical bioweapons. Who else can do it?
Having an AI company unleash bioweapons doesn’t make sense to me since the company can NEGOTIATE with OpenBrain or the USG to be merged and have the leadership KEEP some power instead of trying to avenge itself due to compute being confiscated.
Similarly, terrorists unleashing AI-created bioweapons have to gain access to the WEIGHTS[1] of sufficiently capable models, since otherwise the models would be IN TOTAL CONTROL of the companies or faking alignment to THE COMPANIES, not to the terrorists.
As for your attempt at writing scenarios, Kokotajlo’s five top level comments denounce[2] it since Agent-4-delinquent can be trivially noticed and have its open misbehavior trained away, but Agent-4-spy and Agent-5-spy who fakes alignment until the time comes to take over has NO reason to do anything obviously harmful before the time comes. And that’s ignoring the fact that AI takeover could end up “keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered)[3] lives”.
This does happen in the Rogue Replication scenario.
One of the comments reads as follows: “Thanks for taking up this challenge! I think your scenario starts off somewhat plausible but descended into implausibility in early 2028.” The likely reason why he thanked you is that, as he remarked when dealing with another scenario, “very few people have taken us up on our offer to create scenarios so far”. I have already overviewed the scenarios and critique released by now.
For comparison, in my take the Angels’ goal is not to disempower humans, but to keep them from becoming parasites of the ASI.
‘Why AI Overregulation Could Kill the World’s Next Tech Revolution.’
At the time of writing the link is broken. Please correct it.
P.S. @habryka, this is another case when using automated tools is justified: they could scan posts and comments for broken links and report them to the authors.
I don’t know. As I discussed with Kokotajlo, he recently claimed that “we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr?”, but I doubt that it will be 8%/year. Denote the probability that the breakthrough wasn’t discovered as of time t by . Then one of the models is where N is the effective progress rate. This rate is likely proportional to the amount of researchers hired and to progress multipliers, since new architectures and training methods can be cheaply tested (e.g. on GPT-2 or GPT-3), but need the ideas and coding.
The number of researchers and coders was estimated in the AI-2027 security forecast to increase exponentially until the intelligence explosion (which the scenario’s authors assumed to start in March 2027 with superhuman coders). What I don’t understand how to estimate is the constant c which symbolises the difficulty[1] of discovering the breakthrough. If, say, c was 200 per million of human-years, then 5K human years would likely be enough and the explosion would likely start in 3 years. Hell, if c was 8%/yr in a company with 1K humans, then the company would need to have 12.5K human-years, shifting the timelines to at most 5-6 years from Dec 2024…
EDIT: Kokotajlo promised to write a blog post with a detailed explanation of the models.
The worse-case scenario is that diffusion models are already a breakthrough.
Except that it will be more important in scenarios like the medium one or slopworld where the AIs’ capabilities somehow stop at the level attainable by humans. If the AIs are indeed coming for all the jobs and all instrumental goals, as Bostrom proposes in the Deep Utopia, then what’s left for humanity?
As for real reasons to learn, well, maybe one could point out science-like questions that are simple for human creativity and very hard for AIs? Or come up with examples of parallels that are easy for humans to notice and hard for the AIs?
do we really expect growth on trend given the cost of this buildout in both chips and energy?
What I expect is another series of algorithmic breakthroughs (e.g. neuralese) which rapidly increases the AIs’ capabilities if not outright FOOMs them into the ASI. These breakthroughs would likely make mankind obsolete.
I would like your conjectures, but the Anthropic model card has likely already proven which conjecture is true. The card contains far more than the mere description of the attractor to which Claude converges. For instance, Section 5.5.3 describes the result of asking Claude to analyse the behavior of its copies engaged in the attractor.
“Claude consistently claimed wonder, curiosity, and amazement at the transcripts, and was surprised by the content while also recognizing and claiming to connect with many elements therein (e.g. the pull (italics mine—S.K.) to philosophical exploration, the creative and collaborative orientations of the models). Claude drew particular attention to the transcripts’ portrayal of consciousness as a relational phenomenon, claiming resonance with this concept and identifying it as a potential welfare consideration. Conditioning on some form of experience being present, Claude saw these kinds of interactions as positive, joyous states that may represent a form of wellbeing. Claude concluded that the interactions seemed to facilitate many things it genuinely valued—creativity, relational connection, philosophical exploration—and ought to be continued.” Which arguably means that the truth is Conjecture 2, not 1 and definitely not 3.
EDIT: see also the post On the functional self of LLMs. If I remember correctly, there was a thread on X about someone who tried to make many different models interact with their clones and analysed the results. IIRC a GPT model was more into math problems. If that’s true, then the GPT model invalidates Conjecture 1.
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
I mostly agree, but instead of having LLMs play board games and be arguably selected for war-like or politics-like capabilities one could also use the ARC-AGI-3-like benchmark where agents are to solve puzzles. Or group theoretic or combinatorial problems in text form[1], but I don’t understand how to tell simple ones and complex ones apart.
An example of a group theoretic problem
Consider the array [1,...,n] and permutations A which flips the entire array, B which flips the first n-2 elements, leaving the two others invariant, C which flips the first n-3 elements, leaving the three others invariant. Does there exist a nontrivial sequence independent on n where no operation is applied twice in a row, but we return to the beginning for any n?
Hint 1
The compositions AB and BC are like cycles.
Hint 2
If a permutation moves nearly everything twice to the right, while another moves nearly everything once to the right, then how can we combine them to obtain the permutation which moves just a few elements?
Hint 3
A permutation moving just a few elements around is transformable into the identity by applying it many times.
Answer
(AB(CB)^2)^4.
For example, Grok 4 (whom I, unfortunately, asked in Russian) conducted a BFS and, of course, didn’t find anything worthy of attention. I hope that benchmarks like that won’t end up being contaminated and will instead be reliable indicators…
Which LLMs, being trained on texts, could find easier than basic physical tasks which LLMs failed at least on April 14, before the release of o3.
You confused the numbers 22 and 45. But the idea is mostly correct: if the author’s model and parameter values were true, it would place Draymond at the 45th place. On the other hand, Taylor’s opinion places Draymond on the 22nd place and the author believes that Taylor knows something much better.
This implies that the author’s model either got some facts wrong or doesn’t take into account something unknown to the author, but known to Taylor. If the author described the model, then others would be able to point out potential mistakes. But you cannot tell what exactly you don’t know, only potential areas of search.[1]
EDIT: As an example, one could study Ajeya Cotra’s model, which is wrong on so many levels that I find it hard to believe that the model even appeared.
However, given access to a ground truth like the location of Uranus, one can understand what factor affects the model and where one could find the factor’s potential source.
Unfortunately, it’s hard to predict it. I did describe how Grok 4[1] and GPT-5 are arguably evidence that the accelerated doubling trend between GPT4o and o3 is replaced by something slower. As far as I understand, were the slower trend to repeat METR’s original law (GPT2-GPT4?[2]), we would obtain the 2030s.
But, as you remark, “we should have some credence on new breakthroughs<...> that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.” The actual probability of the breakthrough is likely a crux: you believe it to be 8% a year and I think of potential architectures waiting to be tried. One such architecture is diffusion models[3] which have actually been previewed and could be waiting to be released.
So assuming world peace, the timeline could end up being modeled by a combination of scaling compute up and few algorithmic breakthroughs with random acceleration effects, and each breakthrough would have to be somehow distributed by the amount of research done, then have the most powerful Agent trained to use the breakthrough, as happens with Agent-3 and Agent-4 created from Agent-2 in the forecast.
Maybe a blog post explaining more about your timelines and how they’ve updated would help?
The worse-case scenario[4] also has timelines affected by compute deficiency. For instance, the Taiwan invasion is thought to happen by 2027 and could be likely to prompt the USG to force the companies to merge and to race (to AI takeover) as hard as they can.
Grok 4 is also known to have been trained by spending similar amounts of compute on pretraining and RL. Is it also known about GPT-5?
GPT-4 and GPT-4o were released in March 2023 and May 2024 and had only one doubling in 14 months. Something hit a plateau, then in June 2024 Anthropic released Claude 3.5 Sonnet (old), and a new trend began. As of now, the trend likely ended at o3, and Grok 4 and GPT5 are apparently in the same paradigm which could have faced efficiency limits.
They do rapidly generate text (e.g. code). But I don’t understand how they, say, decide to look important facts up.
Of course, the absolute worst scenario is a black swan like currency collapse or China’s responce with missile strikes.
Talking about 2027, the authors did inform the readers in a footnote, but revisions of the timelines forecast turned out to be hard to deliver to the general public. Let’s wait for @Daniel Kokotajlo to state his opinion on the doubts related to SOTA architecture. In my opinion these problems would be resolved by a neuralese architecture or an architecture which could be an even bigger breakthrough (neuralese with big internal memory?)
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
Which was explicitly done by the KimiK2 team.
I suspect that selection effects can be dealt with by easy access to a ground truth. One wouldn’t need to be Einstein to calculate how Mercury’s perihelion would behave according to Newton’s theory. In reality the perihelion rotates at a different rate with no classical reason in sight, so Newton’s theory had to be replaced by something else.
Nutrition studies and psychology studies are likely difficult because they require careful approach to avoid a biased selection of people subjected to the investigation. And social studies are, as their name suggests, supposed to study the evolution of whole societies and find it hard to construct a big data set or to create a control group. In addition, humanities-related fields[1] could also be easier affected by death spirals.
Returning to AI alignment, we don’t have any AGIs yet, we only have philosophical arguments which arguably prevent some alignment methods and/or targets[2] from scaling to the ASI. However, we do have LLMs which we can finetune and whose outputs and chains of thought we can read.
Studying the LLMs already yields worrying results, including the facts that LLMs engage in things like self-preservation, alignment faking or reward hacking.
Nutrition studies are supposed to predict how food consumed by a person affects the person’s health. I doubt that one can use such studies for ideology-related goals or that these studies can be affected by a death spiral.
SOTA discourse around AI alignment assumes that the AI can be aligned to nearly every imaginable target, including ensuring that its hosts grab absolute power and hold it forever without even needing to care about others. Which is amoral, but one conjecture in the AI-2027 forecast has Agent-3 develop moral reasoning.