rif a. saurous

Karma: 74

rif a. saurous 9 Apr 2026 3:40 UTC
1 point
0
on: AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines
I’m wondering if we need to sharpen the “story” of the ESNI distribution to fit your observation that the 50% and 90% horizons are so far apart. From a first read, my mental model was that an ESNI task is “a bunch of easy steps, each verifiable against tests,” with a low per-step failure rate, with failures generally being catchable by tests, leading to retries. Under this model, I’d expect subsets of the task to be approximately independent, so I’d expect the 50% time horizon to be roughly log(0.5)/log(0.9) ≈ 6.58 times the 90% time horizon. You are claiming a much larger ratio (at least years to days, possibly decades to hours), so something else is going on here.

The simplest underlying model I can come up with is that in your actual ESNI distribution, any given task has some probability of having a “trap”: something that requires ideation or taste (and therefore maybe makes it not really an ESNI task per the original definition), which causes the LLM to fail. And the probability a supposedly-ESNI task contains a trap is very slowly growing (e.g., logarithmic) in the time it would take a human to do the task?

rif a. saurous 8 Dec 2025 6:19 UTC
5 points
1
in reply to: Cole Wyeth’s comment on: Embedded Universal Predictive Intelligence
Author here. We were heavily inspired by multiple things, including Demski and Garrabrant, the 1990′s work of Kalai and Lehrer, empirical work in our group inspired by neuroscience pointing towards systems that predict their own actions, and the earlier work on reflective oracles by Leike . We were not aware of @Cole Wyeth et al.’s excellent 2025 paper which puts the reflective oracle work on firmer theoretical footing, as our work was (largely but not entirely) done before this paper appeared.

rif a. saurous 9 Oct 2025 11:03 UTC
1 point
0
on: What, if not agency?
Formalization of informal ideas will not be the hard part. AI will enable not just automated proofs, not just automated conjectures, but also automated formalization of informal intuitions.

This seems both surprising and extremely crux-y to me. I’m curious if you can offer pointers (beyond “read all of Sahil’s work”) to the best arguments for this.

rif a. saurous 14 Jul 2024 16:29 UTC
28 points
14
in reply to: sunwillrise’s comment on: A simple case for extreme inner misalignment
I’m generally confused by the argument here.
As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don’t represent at all. I’m more intelligent than a mouse, but I wouldn’t say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can’t represent at all, so the overall complexity of his representations in total is plausibly higher.

Why wouldn’t the same thing happen for goals? I’m perfectly willing to say I’m smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog’s goals are more complex than the paramecium’s, and mine are more complex than the dog’s. Any given fixed goal might have a more compressed representation in the more intelligent animal (I’m not sure it does, but that’s the premise so let’s accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don’t even understand the statement of, and these seem like complex goals to me.

So overall I’m not following from the premises to the conclusions. I wish I could make this sharper. Help welcome.
What links here?
- sunwillrise's comment on A more systematic case for inner misalignment by Richard_Ngo (20 Jul 2024 15:09 UTC; 11 points)

rif a. saurous 23 Jun 2024 15:38 UTC
14 points
4
in reply to: Algon’s comment on: My AI Model Delta Compared To Christiano
I feel like a lot of the difficulty here is a punning of the word “problem.”
In complexity theory, when we talk about “problems”, we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for “complexity”). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT (“count the number of satisfying assignments of this Boolean formula”), and it’s generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow (“try every assignment and count up the ones that satisfy the formula”).
On the other hand, “suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)” is a very different sort of beast. I agree it’s not in NP in that I can’t easily verify a solution, but the issue is that it’s not a problem_C, rather than it being a problem_C that’s (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task using exponential amounts of compute; for “choose a refrigerator”, I can’t describe any computational process that will solve at all. If I could (for instance, if I could write down an evaluation function f : fridge → R (where f was computable in P)), then the problem would be not only in NP but in P (evaluate each fridge, pick the best one).
So it’s not wrong to say that “choose a refrigerator” is not (known to be) in NP, but it’s important to foreground that that’s because the task isn’t written as a problem_C, rather than because it needs a lot of compute. So discussions about complexity classes and relative ease of generation and verification seem not especially relevant.
I don’t think I’m saying anything non-obvious, but I also think I’m seeing a lot of discussions that don’t seem to fully internalize this?

rif a. saurous 10 May 2023 9:17 UTC
5 points
0
on: CHAT Diplomacy
I feel like this piece is pretty expansive in the specific claims it makes relative to the references given.
- I don’t think the small, specific trial in [3] supports the general claim that “Current LLMs reduce the human labor and cognitive costs of programming by about 2x.”
- I don’t think [10] says anything substantive about the claim “Fine tuning pushes LLMs to superhuman expertise in well-defined fields that use machine readable data sets.”
- I don’t think [11] strongly supports a general claim that (today’s) LLMs can “Recognize complex patterns”, and [12] feels like very weak evidence for general claims that today’s LLMs can “Recursive troubleshoot to solve problems”.
The above are the result of spot-checking and are not meant to be exhaustive.

rif a. saurous 22 Feb 2023 1:57 UTC
4 points
5
in reply to: JBlack’s comment on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
Thank you, this is helpful.
I think the realization I’m coming to is that folks on this thread have a shared understanding of the basic mechanics (we seem to be agreed on what computations are occurring, we don’t seem to be making any different predictions), and we are unsure about interpretation. Do you agree?
For myself, I continue to maintain that viewing the system as a next-word sampler is not misleading, and that saying it has a “plan” is misleading—but I try to err very on the side of not anthropomorphizing / not taking an intentional stance (I also try to avoid saying the system “knows” or “understands” anything). I do agree that the system’s activation cache contain a lot of information that collectively biases the next word predictor towards producing the output it produces; I see how someone might reasonably call that a “plan” although I choose not to.

rif a. saurous 22 Feb 2023 0:46 UTC
4 points
3
in reply to: JBlack’s comment on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
Suppose we modify the thought experiment so that we ask the LLM to simplify both sides of the “pick a number between 1 and 100” / “ask yes/no questions about the number.” Now there is no new variable input from the user, but the yes/no questions still depend on random sampling. Would you now say that the LLM has chosen a number immediately after it prints out “Ready?”

rif a. saurous 22 Feb 2023 0:07 UTC
1 point
0
in reply to: Bill Benzon’s comment on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
Then wouldn’t you believe that in the case of my thought experiment, the number is also smeared through the parameter weights? Or maybe it’s merely the intent to pick a number later that’s smeared through the parameter weights?

rif a. saurous 21 Feb 2023 23:09 UTC
2 points
1
in reply to: Bill Benzon’s comment on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
But if I am right and ChatGPT isn’t choosing a number before it says “Ready,” why do you think that ChatGPT “has a plan?” Is the story situation crucially different in some way?

rif a. saurous 21 Feb 2023 21:58 UTC
16 points
17
on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
@Bill Benzon: A thought experiment. Suppose you say to ChatGPT “Think of a number between 1 and 100, but don’t tell me what it is. When you’ve done so, say ‘Ready’ and nothing else. After that, I will ask you yes / no questions about the number, which you will answer truthfully.”
After ChatGPT says “Ready”, do you believe a number has been chosen? If so, do you also believe that whatever “yes / no” sequence of questions you ask, they will always be answered consistently with that choice? Put differently, you do not believe that the particular choice of questions you ask can influence what number was chosen?
FWIW, I believe that no number gets chosen when ChatGPT says “Ready,” that the number gets chosen during the questions (hopefully consistently) and that, starting ChatGPT from the same random seed and otherwise assuming deterministic execution, different sequences of questions or different temperatures or different random modifications to the “post-Ready seed” (this is vague but I assume comprehensible) could lead to different “chosen numbers.”
(The experiment is not-trivial to run since it requires running your LLM multiple times with the same seed or otherwise completely copying the state after the LLM replies “Ready.”)
What links here?
- Planning in LLMs: Insights from AlphaGo by jco (4 Dec 2023 18:48 UTC; 8 points)

rif a. saurous 21 Feb 2023 2:16 UTC
5 points
4
on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
I’m not following the argument here.
“I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice.”
It feels like you’re asserting this, but I don’t see why it’s true and don’t think it is. I fully agree that it feels like it ought to be true: it is in some sense still shocking to me that a next-token predictor trained on trillions of tokens is so good at responding to such a wide variety of prompts. But if you look at the mechanics of how a transformer works, as @tgb and @Multicore, it sure looks like it’s doing next-token prediction, and that there isn’t a global plan. There is literally no latent state—we can always generate forward from any previous set of tokens, whether the LLM made them or not.
But I’d like to better understand.
You seem to be aware of Murray Shanahan’s “Talking About Large Language Models” paper. The commenter you quote, Nabeel Q, agrees with you, but offers no actual evidence; I don’t think analogies to humans are helpful here since LLMs work very differently from humans in this particular regard. I agree we should avoid confusing the training procedure with the model, however, what the model literally does is look at its context and predict a next token.
I’ll also note that your central paragraph seems somewhat reliant on anthroporphisms like “it “knows” where it is going”. Can you translate from anthropomorphic phrasings into a computational claim? Can we think of some experiment that might help us get at this better?