Why I Think the Current Trajectory of AI Research has Low P(doom) - LLMs

[This is mostly posted as a some thoughts I wanted to check. I apologize that its messy and not complete. I needed to get something out there.]

This post explains the reasons why I think the probability of AGI killing everyone in the next few decades is very low, at least compared to what Yudkowsky argues.

1.1 By Far, Most Progress Toward AGI comes from LLMs

LLMs have offered such a huge boost towards AGI because they bootstrap the ability to reason by mimicing human reasoning.

Achieving this level of intelligence or even general knowledge through RL seems hardly more plausible now than at the inception of RL. Impressive progress has been made (Deepmind’s AdA learning new RL tasks within the XLand 2.0 environment at human timescales) but this is not going to reproduce even BERT-level general knowledge anytime soon—which is essential for the ability to make impactful decisions in the real world.

I think we can also look at self driving to show this problem—even with huge compute budgets and huge dataset and competition in the space, self driving is still not solved. I think this field shows the progress of AI in general since it has high incentives to take advantage of every AI advance—I suspect it has received much more funding than LLMs and between real and simulated datasets, I would imagine the amount of data collected is comparable to internet text. Self-driving seems very simple compared to general intelligence, and yet somehow LLMs are probably(?) more able describe how to act on the road in a wide range of situations that a self driving car. e.g. that example of driving behind a truck transporting stoplights, or using theory-of-mind-adjacent-abilities to anticipate other drivers’ actions.

I would argue that the closest progress to AGI outside LLMs is AI art generation, since text is a subset of images (you could imagine training diffusion on images of LLM datasets and doing in-painting for next-token-prediction). Text-image diffusion models already have general knowledge about physics, lighting, some amount of language, etc.. We’ve even seen that training diffusion models on audio spectrograms has produced SOTA music/voice generation models. However, this is all also based on learning from internet text/images using a denoising loss, so it has a very similar cap to LLMs, since both text and images cover similar ground (text can describe images). Since they are so similar in training and data and LLMs are becoming multimodal with vision anyway I basically count this as part of the LLM paradigm (which at this point I should refer to with an appropriately more umbrella term—self-supervised/compression-loss/denoising-loss internet-data models? ‘Generative models’ seems too wide an umbrella. I’ll just keep calling them LLMs since language is likely to remain the core foundation for AGI-type intelligence, as we’ve seen with current multimodal models like PaLM-e, Bing Chat, Flamingo, is probably the case with GPT-4...)

1.2 LLM Intelligence is Limitated Below X-Risk-level

The Language Modelling loss caps, at best, somewhere around to human intelligence

With enough compute and larger datasets we will see diminishing returns on next-token-prediction loss.

I think that this limits the amount of intelligence learnable from this method to a safe enough level to not kill everyone. To theorise the limits of what intelligence a certain loss could create, I think it’s important to discus what the loss incentivises rather than what it seems likely to result in. That said, I think it’s not implausible that it caps at higher-than-average-human-level AGI, because it incentivizes A) learning broader and deeper knowledge in every field than any individual human and B) gaining an ability to extrapolate (understand in some sense) any patterns which contain causally left-to-right patterns which humans express in plaintext but may not understand e.g. if there were no formula for computing the Nth digit in Pi in the dataset, an LLM would still be incentivised to learn this formula because Pi occurs in the dataset, but if the pattern has no causally left-to-right patterns then the next token can be predicted from the context, so there is no incentivise to understand of the pattern. However, it does not incentivise learning things that humans cannot/do not express in this way e.g. it does not seem to incentivise the ability to solve unsolved Millenium problems, but does incentivise learning the best reasoning that we have for apporoaching them. The limitations seems especially fenced by the inability of NNs (including LLMs) to generalise far out of training distribution, e.g. Chollet-ARC-like challenges and the inability to reason well enough to self-replicate as tested by the (unrelated) ARC institution on the GPT-4 base model.

Techniques like reflection/hindsight and chain-of-thought, do improve performance, but only in rephrasing our tasks to better align with the language modelling loss—the intelligence limits are still imposed by the original language modelling loss.

Because the LLM paradigm has this cap, and because no other AI field seems to have another towards AGI even close to BERT, then we cannot expect progress toward higher-levels of AGI to continue exponentially as LLMs have let us—it simply stops at LLMs until we can think of something else.

Roughly human-level intelligence is probably not enough to kill us all

Being capped somewhere between human level and humanity level AGI is still pretty dangerous, but it does seem to offer some barriers to especially dangerous stuff: it’s not going to be able to discover ‘magic’ like its nothing: it will have as much difficulty finding way to write better AI as we are, so FOOM is not a concern; it will likely remain with some kind of human-like intelligence, and would not be able to do tasks completely impossible to humans like simulating protein folding within its neurons—it will have to use and build tools like AlphaFold, which makes it much more interprettable as to what its doing; the langauge modelling loss pretty strictly only enforces following one-to-a fewish streams of though (e.g. in a conversation), it doesnt seem like.

It also would still reason, as current models do, in human language. This make interprettability extremely easy, you can either check the outputs yourself for any concerning chains-of-thought or potentially dangerous tool invocations, or if it is running too quickly for that you can run sentiment analysis on it. It seems unlikely that it will write a new language to avoid this, although human do construct artificial languages so its not inconceivable, but it seems likely this could be detected with language classifiers or anomaly detection. At any rate, to design this new language it would probably need to reason about how to do this in its current language, which we could detect with semantic analysis.

2.1 LLM Dangers

Regardless of how strong an LLMs abilities to kill everyone is, it would be nice to know it is not incentivised to do so.

The language modelling loss does not incentivise goal seeking, self-preservation or a host of other issues. However, as we’ve seen with ARC’s tests, it does incentivise the ability to simulate text written by somebody with these goals, sub-goals and abilities, and this text can have external tools hooked in to execute these textual actions, e.g. code execution, TaskRabbit assignments. This is kind of indirect instrumental convergence—to meet the LM loss, it must be able to undertstand evil text on the internet and how to replicate it, although its not incentivised to actually do those things, just write about how they might be done. Changing the prompt such that the expected action descriptions are corrigibile or avoid power-seeking will make it avoid these things. These are just observations on the default LLM alignment.

With very small amounts of finetuning/RL and drawing from ‘simulacra’ personalities seen in pretraining, LLMs can be made to act with agentic goals and behaviours more explicitly (I don’t think RL specifically introduces a unique danger because other methods exist to get similar behaviours e.g. supervised finetuning, distillation).

This is where the dangers of LLMs come in. They may be neutral in terms, but when given instructions or finetuned to produce agentic text, etc.. they become perfectly . Its kind of the perfect demonstation of the orthogonality of capabilities and alignment

2.2 LLMs are Especially Alignable

As of yet, finetuning/RL has only been used to steer these models so they produce desired outputs without needing complex prompts, and does not significantly improve their intelligence, often actually reducing it in favour of ease of use. However, this does not stop the fact that this permits taking an LLM which could have average-human-to-humanity-level-AGI and getting it to produce text for nefarious purposes*. However, the chances of this occurring in reality seem unlikely, based on the fact that this model would be extremely expensive to train and would likely be kept under wraps for economic and safety purposes. It would likely be finetuned to do exactly the opposite as we are seeing with GPT-4, which is notably more aligned with the desired behaviours than GPT-3.5, which is of note: it is more capable (I believe 20% improvement on average across all released benchmarks, which are likely generous) but significantly more aligned—it produces 40% less hallucinations.

There are further reasons why I believe LLMs are especially aligneable aside from the fact that RLHF has shown much success in improving alignment to desired behaviours.

LLMs are neutral—Expanding from earlier, the only action LLMs have is generating text (regardless of if this is used downstream for something else e.g. tools) and the evilness or goodnes of the text they produce is completely dependant on how they are prompted. They do not inherently produce text which describes goal oriented actions, they simply can be prompted to do so. In this sense, it is very easy to get them to act how we want by iterating on simple alignment techniques like prompts and RL/finetuning.

RLHF from scratch—This is unexplored and research conjecture, but it seems to me that the majority cause for misalignment in LLMs with RLHF comes from something shown in the shoggoth meme—it is pretrained with the ability to simulate many capabilities and ideas and goal-pursuits which are harmful and cannot be removed with a small amount of RLHF, although this RLHF is a good enough alignment technique to incentivise the behaviours we want, (with the exception of very subtle bad behaviours that we cannot easily identify on scale for an RLHF dataset, such as truthfulness, but as of yet only identify during more thorough evaluation). However, with enough compute and human preference data, and assuming you can train/finetune/prompt a standard LLM to accurately classify text as safe or unsafe, it seems plausible that an LLM could be trained from scratch using RLHF/RLAIF to follow human preferences more more strongly, possibly even making later danger of finetunability-for-nefarious-purposes much less fruitful. Since there is a finite set of text that an LLM could output that is in human preferences, and NNs generalise, we only need to get enough data to train both the RLAIF reward model and target LLM to learn thi distinctio before outer alignment is converged on in a good-enough way. As of yet, inner alignment does not seem to be an issue with these models due to the density of reward. The only pace this seems to break down is where it proposes text which seems aligne dbut is actually not. Luckily this seems to end at the extreme of deeption for subtle facts, but does not incentivise lying about everything. It also seems unlikely that this incentivises finding deceptive ways to write actions which might result in seriously harmful actions—if we can’t tell if an action is harmful or not it seems unlikely that it will be harmful, at least as far as text goes. Another great nicety is, training on RLHF from scratch means we can define everything about how we want it to behave, since alignment is now the sole objective function—be corrigible, ask before you do anything world-changing, always question if you are following human values and ask humanity if you are.

OOD—AFAIK This is the cause for much alignemnt issues. In the case of LLMs, this is the question of getting LLMs to deal with adversarial prompts. A good thing about LLMs is that they operate only in text space (and even those that are made multi-modal can use frozen LMs, preserving alignment) for which it is easy to simulate a wide variety of situations to train against—even fictional scenarios. The text received from real world sources is therefore actually a subset of all possible text, which is a very unique advantage to the alignment issue for LLMs. We can actually have a simulated training training and testing in a wider distribution than the reality it will be put in, which still does not solve the OOD problem since training/testing text is also limited by our imaginations and reality may be outside of our imaginations and therefore be OOD (as is likely the case for OpenAIs ChatGPT and limited RLHF data). However, if that becomes the case and we see the world is chaning significantly, we can simply freeze the model before before the world becomes significantly OOD, add the new world data back into training data, and finetune to it as an OOD precaution.

Interpettability—The fact that we can read LLMs chain-of-thought reasoning as during ARC’s tests of GPT-4, and can get an unsupervised measure of output truthfulness from them[1] seem good enough to catch deceptive behaviour. We can simply ask its plans, ideals, alignment and check if its lying to us. It can’t control its own weights anymore than we can—it has no actuator to affect them.

FOOM—Several possible lines of defense:

From the above, it seems that an LLM could be aligned via RLH/AIF such that it always asks before making FOOM or even avoid it altogether, especially because more performant LLMS are respond better to instruction-following finetuning
For similar reasons, it seems possible to align LLMs good enough that if they code FOOM it will attempt to align it close enough to its values which are close enough to human values. e.g. that LangChain Twitter post where somone asked an LLM to make a papercliip maximiser and it discussses the dangers, concludes its too dangerous and refuses to do it
The whole process of FOOM-by-LLM would be interprettable—while this LLM may theoretically be as smart as all of humanity, it still thinks in our language in chain-of-thought. It also would program this FOOM like how LLMs code now—simply writing code in existing languages using existing techniques, usually with lots of comments
FOOM, imo, is unlikely to be possible in the first place. As discussed earlier, I think humanity is capped at making humanity-level AI via LLMs for a very long time. If we as a human collective have this limit, and assuming the LLM is capped at humanity-level-intelligence, it will not find this much easier. The logic of ‘if we make AI which is smarter than us, then it can do what we do better, including write AI better’ hits a wall if the action cannot be made. An example simple FOOM would be to use itself as the reward model for a successor LLM, basically just training it on a cleaner dataset, which wouldn’t be an insignificant improvement (we know just dataset cleaning can actually beat scaling laws[2]), but doesn’t sound like it would scale to a type of disaster FOOM, and the whole process seems like it would be about as interpretable as it was the first time.

Considering the above we get: LLMs seem incapable of killing everyone because the intelligence ceiling of langauge modelling/self supervised learning on internet data doesnt seem high enough (for similar reasons, are also probably not smart enough to thwart all conceivable stop buttons); we have interpretability tools to detect if they are at least trying to, allowing us to rollback or press the stop button; we can probably align them well enough to human values to have them not attempt this; AI research seems mostly in the hands of people who would attempt to align it this way and it is unlikely a nefarious lab has the abilities to produce these; if FOOM is possible, it seems likely we can RLHF LLMs to do it in a human interprettable, supervised way, and can use basic inteprettability (read the chain of thought) to check if it ever tries to do this on its own.

How I imagine this AI research trajectory planning out is that we continue making larger LLMs able to use more tools, inlcuding multimodal input/output, and we will see some amount of economic turmoil both in terms of workplace displacement and as a result of their wide adoption despite their brittleness, though this will probably improve. Large labs will continue to be at the forefront of training the bulk of reasoning capabilities into them via self-supervised losses nn human and environmental data (text, vision, audio..). Eventually, one day one of the large labs, having achieved GPT-N, will send the base model to ARC for evaluation and they will find that it can self-replicate and can coordinate with copies of itself to do actually dangerous things. It seems likely that this will concern them and they will invest in better RLHF techniques. Will they still release it if its still imperfectly aligned? I think its very unlikely to be released if its not thoroughly aligned and tested given the discourse atm, but its not unimaginable. Will this kill everyone? Seems very unlikely it would be released if it were that badly aligned. But I don’t think so due to the aforementioned ~human intelligence cap. Having it actually do unaligned things in the real world, once released, also requires someone intentionally adversarially prompt it to do this, which also reduces that chance. This is becoming more and more speculative so I’m going to end it here, but, my point is assuming we don’t find some AI breakthrough that leads to human-like intelligence other than LLMs, which uses some loss thats much more agentic (e.g. simulate reality and do evolution until its smarter than us all) or leave much less ability to know whats in the dataset (in the previous example thats true because in RL whats in its training it partially determined by itself), etc.. this timeline seems fairly likely to be OK to me.

[1] Discovering Latent Knowledge in Language Models Without Supervision, https://openreview.net/forum?id=ETKGuby0hcs

[2] Beyond neural scaling laws: beating power law scaling via data pruning
https://arxiv.org/abs/2206.14486