LLMs seem (relatively) safe

Link post

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds of AI bills moving through the world’s legislatures, and Eliezer Yudkowsky is featured in Time Magazine.

For people who have been predicting, since well before AI was cool (and now passe), that it could spell doom for humanity, this explosion of mainstream attention is a dark portent. Billion dollar AI companies keep springing up and allying with the largest tech companies in the world, and bottlenecks like money, energy, and talent are widening considerably. If current approaches can get us to superhuman AI in principle, it seems like they will in practice, and soon.

But what if large language models, the vanguard of the AI movement, are actually safer than what came before? What if the path we’re on is less perilous than what we might have hoped for, back in 2017? It seems that way to me.

LLMs are self limiting

To train a large language model, you need an absolutely massive amount of data. The core thing these models are doing is predicting the next few letters of text, over and over again, and they need to be trained on billions and billions of words of human-generated text to get good at it.

Compare this process to AlphaZero, DeepMind’s algorithm that superhumanly masters Chess, Go, and Shogi. AlphaZero trains by playing against itself. While older chess engines bootstrap themselves by observing the records of countless human games, AlphaZero simply learns by doing. Which means that the only bottleneck for training it is computation—given enough energy, it can just play itself forever, and keep getting new data. Not so with LLMs: their source of data is human-produced text, and human-produced text is a finite resource.

The precise datasets used to train cutting-edge LLMs are secret, but let’s suppose that they include a fair bit of the low hanging fruit: maybe 5% of publicly available text that is in principle available and not garbage. You can schlep your way to a 20x bigger dataset in that case, though you’ll hit diminishing returns as you have to, for example, generate transcripts of random videos and filter old mailing list threads for metadata and spam. But nothing you do is going to get you 1,000x the training data, at least not in the short run.

Scaling laws are among the watershed discoveries of ML research in the last decade; basically, these are equations that project how much oomph you get out of increasing the size, training time, and dataset that go into a model. And as it turns out, the amount of high quality data is extremely important, and often becomes the bottleneck. It’s easy to take this fact for granted now, but it wasn’t always obvious! If computational power or model size was usually the bottleneck, we could just make bigger and bigger computers and reliably get smarter and smarter AIs. But that only works to a point, because it turns out we need high quality data too, and high quality data is finite (and, as the political apparatus wakes up to what’s going on, legally fraught).

There are rumblings about synthetic data, that basically a strong LLM can generate a bunch of text that’s as good as human text, and then that can be fed back in to train future models. And while it’s possible that this will work, or even has already been proven to work behind closed doors somewhere, I’m currently skeptical; the whole point of using human-derived data is that human-produced text describes the actual world, and if you slurp up enough of it you end up understanding the world by proxy. Synthetic data would reinforce whatever issues exist in the model, creating text with the same blind spots over and over again, and thus increasing their size. There could be technical solutions to this; again, maybe they’re already underway. But to my nose, as a person not in those private rooms, the notion smells like hype.

And putting aside synthetic data, we may be in for a slog to get superintelligent LLMs. We can improve model architectures, clean datasets, fine tune, and all the other little things to wring value out of this already valuable technology. But the limiting reactant in LLMs is data, and data is finite.

LLMs are decent at human values

Until now, I’ve been talking about AI capabilities—reasons why modern AI systems may fizzle out before they are powerful enough to immanentize the eschaton. Now I’m going to shift to AI alignment, which I consider much thornier. AI alignment is, basically, AI systems doing what human beings want them to do.

Without getting into the details of why AI alignment is hard and complicated (and boy are there details), one desideratum is an AI system understanding human values in the first place. There are lots of reasons it’s hard to be sure an empowered AI system has our best interests at heart, but with LLMs, at least, it seems plausible such an AI could make a decent guess at what those interests are.

For example, here’s what Claude 3 Opus, a current leading LLM, said when asked to give a single nuanced paragraph about surge pricing:

Surge pricing, a practice commonly used by ride-sharing services and other businesses, raises complex moral questions. On one hand, it can be seen as a fair and efficient way to match supply with demand, encouraging more drivers to work during peak times and ensuring that customers who most need a ride can get one. It could be argued that surge pricing is a voluntary exchange that benefits both parties. However, critics argue that it disproportionately affects low-income individuals who may have fewer transportation alternatives and be more vulnerable to price exploitation. Surge pricing can make essential services unaffordable for some, exacerbating existing inequalities. There are also concerns about the transparency and predictability of surge pricing algorithms. Ultimately, the morality of surge pricing depends on how one weighs these competing considerations of economic efficiency, fairness, and access in the context of essential services that people rely on.

Is it the greatest moral philosophy I’ve ever read? No. But clearly this is a system that has some idea what ethical considerations are, and how they are deployed. I think that, like data as a bottleneck, this is something that’s easy to take for granted in the modern moment. But taking a step back, it’s kind of remarkable: if anything, modern AI is too concerned with following ethical guidelines, with people all over the internet making fun of it for refusing benign requests on ethical grounds.

Now it’s totally possible to train models with no ethical compunctions, or even models (generally with scaffolding) that actively seek to do harm. Furthermore, it’s dangerous to confuse the role a model seems to play through its text with the actual underlying mechanism. Technically, Claude’s paragraph about surge pricing is the result of a system being told it’s about to read a helpful assistant’s answer to a question about surge pricing, and then that system predicting what comes next. So we shouldn’t read too much into the fact that our chatbots can wax poetic on ethics. But nobody expected chatbots that waxed poetic on ethics six years ago! We were still trying to get AI to kick our asses at games! We’re clearly moving in the right direction.

LLMs being able to produce serviceable ethical analyses (sometimes) is also a good sign if the first superhuman AI systems are a bunch of scaffolding around an LLM core. Because in that case, you could have an “ethics module” where the underlying LLM produces text which then feeds into other parts of the system to help guide behavior. I fully understand that AI safety experts, including the one that lives in my heart, are screaming at the top of their lungs right now. But remember, I’m thinking of the counterfactual here: compared to the sorts of things we were worried about ten years ago, the fact that leading AI products could pass a pop quiz on human morality is a clear positive update.

Playing human roles is pretty human

Going back to AlphaGo again, one feature of that era was that AI outputs were commonly called alien. We’d get some system that achieved superhuman performance, but it would succeed in weird and unnerving ways. Strategies turned out to dominate that humans had ruled out long ago, as the machine’s tactical sensibility transcended our understanding.

I can imagine a world where AI continues from something like this paradigm, where game-playing AIs gradually expand into more and more modalities. Progress would likely be much slower without the gigantic vein of powerful world-modelling data that is predicting human text, but I can imagine, for example, bots that play chess evolving to bots that play go evolving into bots with cameras and sensors that play Jenga, and so on, until finally you have bots that engage in goal-directed behavior in the real world in all its generality.

Instead, with LLMs, we show them through our text how the world works, and they express that understanding through impersonating that text. It’s no coincidence that one of the best small LLMs was created for roleplay (including erotic roleplay—take heart Aella); roleplay is the fundamental thing that LLMs do.

Now, LLMs are still alien minds. They are the first minds we’ve created that can produce human-like text without residing in human bodies, and they arrive at their utterances in very different ways than we do. But trying to think marginally, an alien mental structure that is built specifically to play human roles seems less threatening than an alien mental structure that is built to achieve some other goal, such as scoring a bunch of points or maximizing paperclips.

And So

I think there’s too much meta-level discourse about people’s secret motivations and hypocrisies in AI discussion, so I don’t want to contribute to that. But am sometimes flummoxed by the reaction of oldschool AI safety types to LLMs.

It’s not that there’s nothing to be scared of. LLMs are totally AI, various AI alignment problems do apply to them, and their commercial success has poured tons of gas on the raging fire of AI progress. That’s fair on all counts. But I also find myself thinking, pretty often, that conditional on AI blowing up right now, this path seems pretty good! That LLMs do have a head start when it comes to incorporating human morals, that their mechanism of action is less alien than what came before, and that they’re less prone, relative to self-play agents, to becoming godlike overnight.

Am I personally more or less worried about AI than I was 5 years ago? More. There are a lot of contingent reasons for that, and it’s a story for another time. But I don’t think recent advances are all bad. In fact, when I think about the properties that LLMs have, it seems to me like things could be much worse.