I don’t think (A) is inconsistent. If one believes that “freeing other people from work” is a very good thing, then the value of doing so is worth the personal sacrifice/effort/labor (even if there is no personal financial motive).
simulus
This is a great write-up and I found myself reading a few of your previous posts too. However, I’m left by all of this with a disappointingly mundane vision of the future where technology doesn’t change much.
I would be curious to hear about some tech and ideas that you’re optimistic about or that you think *will* deliver.
The arguments here match a lot of my own intuitions. I want to add a few things:
1. Even benchmarks that supposedly measure sample efficiency on abstract problems fall victim to human priors: ARC Is a Vision Problem! - https://arxiv.org/abs/2511.14761
2. Human learning largely performed on-policy, while pretraining is primarily off-policy. This means that humans can seek out the information that they specifically lack, and receive feedback to address their specific mistakes. I predict that the shift towards RL (and more recently on-policy distillation) is the first phase of a broader transition towards primarily on-policy training pipelines that will bring gains in both sample and parameter efficiency.
I like this idea.
Models with access to a python interpreter might be able to solve it trivially by calling a random function.
I wonder if there are examples of RL training on (more useful) tasks like this where reward is predicated on the distribution of the model’s outputs over multiple samples.
I think part of this is due to the choice of tested models: Sonnet Sonnet, Opus, GPT-5, Sonnet. If the tested models were consistently flagship offerings (Opus, GPT-whatever), then the trend would be more clear.
Furthermore, the most recent model that they tested was Sonnet 4.5. Most people agree that there was a noticable jump in quality around Opus 4.5/4.6, and that is missing from the graph. Opus 4.5/4.6 also seemed to kick off a trend of labs focusing heavily on agentic coding, which would change the slope of the trend.
In section 1 of your findings, the “Expected” column of the table is misleading. It assumes that probabilities within each bucket are uniformly distributed (or more generally, symmetrically distributed around the center of the bucket’s range).
A more faithful Expected value would be the average over each market’s probability within a given bucket. This is the true rate that perfectly calibrated markets would resolve at.
I suspect that this is the cause of the significant discrepancy between the Expected and Actual resolution rates near the 0% and 100% extremes.
Can you speak more on what the experimental results would have looked like if theories like VC dimension had been correct?
My point is that since so much research goes into building models that maximize advertising results, an ad click maximizer is the most likely case of blatant outer alignment failure. Ad clicks will be its end goal.
The paperclip maximizer will be an ad click maximizer.
Clarification: Since so much research goes into building models that maximize advertising results (even being the reason Google started building TPUs), an ad click maximizer is the most likely case of blatant outer alignment failure from a poorly chosen objective.
This knocks on the door of a principle that I have been playing with for a while: a good continual learning and/or sequence modelling algorithm should converge to some known behavior. Architectures like attention have undefined behavior once the end of their training context length is reached. SGD on the other hand can be run indefinitely, because we know that it will eventually converge to an interpolation of the data.
Someone I know claims to have found a way to directly pretrain neuralese models: https://aklein.bearblog.dev/zebra/
I’ve seen their prototype, and it definitely works (as far as producing reasonable text outputs while making non-trivial use of >100 continuous latents), but whether it actually amounts to anything remains to be seen.
This appears true at the academic scale, but not at the frontier scale where RL compute consumption is much higher (sometimes even higher that pretraining).
As a counter-example to your evidence, when Nvidia scaled up their RL they found:
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights —RL can indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that our model generates novel insights and performs exceptionally well on tasks with increasingly difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond its initial training. Most strikingly, we identify many tasks where the base model fails to produce any correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100% pass rates (Figure 4).
Steerling-8B: The First Inherently Interpretable Language Model
This is probably worth a deeper discussion, but Guide Labs is claiming that their new model is “the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data”.
Reading the blog, Steerling is basically just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder before the LM head.
They also appear to apply a loss that aligns the SAE’s activations with labelled concepts (correct me if I’m wrong). However, this seems like an obvious example of The Most Forbidden Technique, and could make the model appear interpretable without the attributed concepts actually having causal effect on the model’s decisions.
Can we get some input from interpretability folks? I’m obviously bearish.
Link to release post: https://www.guidelabs.ai/post/steerling-8b-base-model-release/
...
I wonder if this is an artifact from the training data.
There are probably more edge-case bugs in published code (or even intermediate commits) than there are obvious bugs.
A recent paper probed LLMs and located both value features (representing the expected reward) and “dopamine” features (representing the reward prediction error). These features are embedded in sparse sets of neurons, and were found to be critical for reasoning performance.
Could these findings have any implications for model welfare?
If a model had mechanisms for “feeling good and bad”, I imagine they would look similar to this.
The paper in question: https://arxiv.org/abs/2602.00986
Yes, I am referring to the lack of learning-to-learn data during initial training.
Your point that humans have built-in mechanisms for continual learning is similar to what I’m saying about inductive biases: if we don’t have the data to train continual learning into models, we need to build it into the architecture.However, I think the ‘data’ from which humans learn during development (on-policy interactions with the environment with constant feedback and something like rewards) is much more aligned to continual learning than books and pdfs.
I believe that the biggest bottleneck for continual learning is data.
First, I am defining continual learning (CL) as extreme long-context modelling with particularly good in-context supervised and reinforcement learning. You seem to have a similar implied definition, given that Titans and Hope are technically sequence modelling architectures more than classical continual learning architectures.
Titans might already be capable of crudely performing CL as I defined it, but we wouldn’t know. The reason is that we haven’t trained it on data that looks like CL. The long-context data that we currently use looks like pdfs, books, and synthetically concatenated snippets. None of that data, if you saw a model producing it, would you consider to be CL. The data doesn’t contain failures, feedback, and an entity learning from them. If we just trained the architecture on (currently non-existent to the public) data that looks like CL, then I think we would have CL.
The obvious solution to this problem is to collect better data. This would be expensive, but the big players could probably afford it.
Another solution that I see is to bake a strong inductive bias into the architecture. If CL is an out-of-distribution behavior relative to the training data, then the best option is an architecture that “wants” to exhibit CL-like behavior. Taken to the extreme, such an architecture would exhibit CL-like behavior without any prior training at all. One example would be an “architecture” that just fine-tunes a sliding-window transformer on the stream of context. Of the current weight-based architectures, I think E2E-TTT is the closest to this vision, since it is essentially meta-learned fine-tuning.
The final solution is to use reinforcement learning instead of pretraining to get CL abilities. If getting high rewards necessitates CL, then we would expect RL to eventually bake in continual learning. The problem is that RL is just so costly and inefficient, and we lack open-ended environments with unhackable rewards.
Anecdotally, as someone who works on non-AGI-targetting AI research, I find pop-sci articles on AI research to be horribly misrepresentive.
A paper that introduces a new algorithm that guides drones around a simulator by creating sub-tasks might be presented as “AI researchers create a new kind of digital brain—and it has its own goals”. That’s obviously a click-bait headline, but the article itself usually does little to clean things up.
However, I would imagine that AI is currently among the worst fields for this kind of thing due to manufactured hype, culture wars, and the age-old anthropomorphization of AI algorithms.
ProRL is a pretty hard counter to the “RLVR only elicits existing capabilities” argument:
Conflicting results from other papers probably come from:
Not enough scale (RL is super inefficient, so it takes a long time to really work)
Poor training methods (continued improvement requires entropy regulation to prevent diversity collapse and stalled exploration)
Another counter-argument is classic RL experiments like AlphaZero and Atari gameplay. These obviously elicit new capabilities, since the models reach superhuman performance after starting with no knowledge at all. In theory, there’s no reason that RLVR couldn’t teach untrained LLMs to do math from scratch (but in practice sparse rewards mean that the sun would burn out before the model figures it out).