LLMs and computation complexity

Epistemic status: Speculative. I’ve built many large AI systems in my previous HFT career but have never worked with generative AIs. I am leveling up in LLMs by working things out from base principles and observations. All feedback is very welcome.

Tl;dr: An LLM cannot solve computationally hard problems. Its ability to write code is probably its skill of greatest potential. I think this reduces p(near term doom).

An LLM takes the same amount of computation for each generated token, regardless of how hard it is to predict. This limits the complexity of any problem an LLM is trying to solve.

Consider two statements:

  1. “The richest country in North America is the United States of ______”

  2. “The SHA1 of ‘abc123’, iterated 500 times, is _______”

An LLM’s goal is to predict the best token to fill in the blank given its training and the previous context. Completing statement 1 requires knowledge about the world but is computationally trivial. Statement 2 requires a lot of computation. Regardless, the LLM performs the same amount of work for either statement.

It cannot correctly solve computationally hard statements like #2. Period. If it could, that would imply that all problems can be solved in constant time, which is provably (and obviously) false.

Why does this matter? It puts some bounds on what an LLM can do.

Zvi writes:

Eliezer Yudkowsky does not see any of this as remotely plausible. He points out that in order to predict all the next word in all the text on the internet and all similar text, you need to be able to model the processes that are generating that text. And that predicting what you would say is actually a good bit harder than it is to be a being that says things—predicting that someone else would say is tricker and requires more understanding and intelligence than the someone else required to say it, the problem is more constrained.

And then he points out that the internet contains text whose prediction outright requires superhuman capabilities, like figuring out hashes, or predicting the results of scientific experiments, or generating the result of many iterations of refinement. A perfect predictor of the internet would be a superintelligence, it won’t ‘max out’ anywhere near human.

I interpret this the opposite way. Being a perfect predictor of the internet would indeed require a superintelligence, but it cannot be done by an LLM.

How does an LLM compute?

What kinds of problems fall into category 2 (i.e., clearly unanswerable by an LLM)? Let’s dig in to how an LLM computes.

For each token, it reviews all the tokens in its context window “at least once”, call it O(1) time. To produce n tokens, it does O(n^2) work. Without being too precise about the details, this roughly means it can’t solve problems that are more complex than O(n^2). [1]

Consider some examples (all tested with GPT-4):

Addition, O(1)

It’s not always accurate, but it’s usually able to do addition correctly.

Sorting, O(n log n)

I asked it to sort 100 random integers that I’d generated, and it got it right.

My guess is that it doesn’t have the internal machinery to do a quick sort, and was probably doing something more like O(n^2), but either way that’s within its powers to get right, and it got it.

Matrix multiplication, O(n^3)

I generated a 3x3 matrix called A and told it to compute A*A. This was interesting, let’s look at what it did:

Pasted Graphic.png

Pretty cool! It executed the naive matrix multiplication algorithm by using O(n^3) tokens to do it step-by-step. If I ask it to do it without showing its work, it hallucinates an incorrect answer:

Pasted Graphic 1.png

The result was the right shape, and the elements had approximately the right number of digits. Slight problem: the elements are all incorrect. Whoops. This makes sense though. These numbers are random, so it’s unlikely to have memorized the answer to this specific problem. Absent that shortcut, it didn’t do O(n^3) work, so it could not have generated the correct answer.

Four-sum problem, O(n^4)

Pasted Graphic 2.png

It gave me four numbers that added up to my target, but the fourth wasn’t in my input. It hallucinated it to meet the constraint. Same as with matrix multiplication, it gave an answer that looks vaguely correct, but it didn’t do O(n^4) work so it couldn’t have been right.

But watch what happens when I let it show its work:

Pasted Graphic 3.png

Cool! It wrote and executed code to solve the problem, and it got it right.

What did I learn?

This matches my expectation that without showing its work it caps out at roughly O(n^2).

It can do better if it’s allowed to run algorithms by “thinking out loud”. It’s really slow, and this is a good way to fill up its context buffer. The slowness is a real problem—if it outputs ~10 token/​sec, it will take forever to solve any problems that are actually both big and hard. This is a neat trick, but it doesn’t seem like an important improvement to its capabilities.

The most interesting bit is when it writes its own code. The ceiling on the types of problems that you can solve with code is arbitrarily high. This is obviously the most uncertain path as well. It can solve toy problems, but it remains to be seen whether it can write useful code in complex systems. The difference is like acing a technical interview versus being a 10x programmer. (If you think this distinction is obvious, you’ve probably been a hiring manager for a technical role).

Additional thoughts

Many people have noticed that asking ChatGPT/​Bing to show its work can result in smarter answers. I believe this isn’t just a trick of prompt engineering. It has no internal monologue, so showing its work actually allows it to think harder about a question than it otherwise could, and allows it to solve harder classes of problems.

This makes me more pessimistic about the potential of AutoGPT- or BabyAGI-like approaches. Quick summary: they tell GPT to break problems down into subproblems, and loop over a prioritized task list to create agency. But “show your work” feels mismatched for truly hard problems—executing an algorithm one linguistic token at a time is just so computationally inefficient.

Predictions

This gives us clues about the future direction and limitations of LLMs and other NN-based models.

  • Can an LLM directly “solve the internet”, as described by Zvi and Eliezer above? No way. It can never do enough computation to predict the next token.[2]

  • Can an LLM beat Stockfish at chess? As a commenter notes in that market, chess is combinatorial so an LLM has no chance. I agree. An LLM lacks the ability to search the game tree to any sufficient depth to contend with Stockfish. FLOP for FLOP, Stockfish is going to be so much more efficient at solving chess that an LLM cannot compete.

  • Can an LLM defeat the best human at chess? If the LLM is allowed to think out loud, it might be able to play in a human-like manner: it could search a few positions per second using an excellent “intuitive” evaluation function. It might be able to explore the game tree to a reasonable depth in this fashion, enough to compete with humans. I still think it’s unlikely, but it’s at least plausible. If it is only allowed to write moves with no intermediate output, then it has no chance.

  • Can an LLM reach a world-dominating level of superintelligence? If it gets good at coding, then possibly. But I believe that dominating the world will require solving some really hard problems. Being a master strategist, predicting markets, doing nanotech (e.g., protein folding), or influencing the interactions of many diverse economic and political agents – these all take a lot of computation. I just don’t see how LLMs have any path towards solving these directly. “Show your work”, prompt engineering, and scaling the model size won’t provide enough computational power to become superhuman in important problems. I expect that the major AI labs are already aware of this, and that the people trying to make ASIs are working on alternatives rather than hoping that the current approaches can scale far enough.

  • How does this affect my estimate of near-term doom? It mildly lowers it. ASI is plausible as a concept, but I don’t think simple deep nets-based methods will get us there. That’s not to say that LLMs won’t impact the world. Facebook and TikTok are examples of plain old AI that influences people, so the potential for disruption and havoc is still there. But an agentic, superhuman AI, where LLMs are doing the heavy lifting? That seems less likely. This isn’t the easiest thing to work around, either. Deep nets have been the dominant approach for the last 10-20 years, so moving past them would be a major advance, not a trivial adaptation.

Thanks for reading. This is my first LW post, so please be kind. I’ll try to reply to all feedback, and I’d love for you all to poke holes in this.

  1. ^

    There are a lot of annoying details when I try to be precise about computational complexity of an LLM. For example, we’re used to thinking that addition is O(1), because a CPU takes constant time to add two numbers. But an LLM acts on the textual representation of a number, which has length log n. Precisely defining the size of our inputs does weird things to the complexity of known algorithms, so I’m not thinking too hard about it. I don’t think it changes any of my results.

  2. ^

    An LLM could get around this limitation by “going meta”. When generating the next token, it could say “I can’t predict this, let me spawn some sub-agents to think about this”. Those agents would be able to write code, call external APIs, run computationally expensive subroutines, and return an answer. This would be very hard to train, and SGD doesn’t really apply. This would require a major rearchitecture of the system, enough that the result probably wouldn’t be an LLM anymore.