simulus

Karma: 129

simulus 9 Jun 2026 20:08 UTC
8 points
1
on: Some Interesting Papers on RLVR
ProRL is a pretty hard counter to the “RLVR only elicits existing capabilities” argument:
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights —RL can
indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient
training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that
our model generates novel insights and performs exceptionally well on tasks with increasingly
difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond
its initial training. Most strikingly, we identify many tasks where the base model fails to produce any
correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100%
pass rates (Figure 4).
Conflicting results from other papers probably come from:
1. Not enough scale (RL is super inefficient, so it takes a long time to really work)
2. Poor training methods (continued improvement requires entropy regulation to prevent diversity collapse and stalled exploration)
Another counter-argument is classic RL experiments like AlphaZero and Atari gameplay. These obviously elicit new capabilities, since the models reach superhuman performance after starting with no knowledge at all. In theory, there’s no reason that RLVR couldn’t teach untrained LLMs to do math from scratch (but in practice sparse rewards mean that the sun would burn out before the model figures it out).

simulus 8 Jun 2026 23:13 UTC
4 points
2
in reply to: Tim H’s comment on: Tim H’s Shortform
I don’t think (A) is inconsistent. If one believes that “freeing other people from work” is a very good thing, then the value of doing so is worth the personal sacrifice/effort/labor (even if there is no personal financial motive).

simulus 2 Jun 2026 1:33 UTC
4 points
0
on: Tech I’m skeptical of and why
This is a great write-up and I found myself reading a few of your previous posts too. However, I’m left by all of this with a disappointingly mundane vision of the future where technology doesn’t change much.

I would be curious to hear about some tech and ideas that you’re optimistic about or that you think *will* deliver.

simulus 1 Jun 2026 22:04 UTC
14 points
7
on: Dissolving the Deep Learning Sample Efficiency Gap
The arguments here match a lot of my own intuitions. I want to add a few things:

1. Even benchmarks that supposedly measure sample efficiency on abstract problems fall victim to human priors: ARC Is a Vision Problem! - https://arxiv.org/abs/2511.14761

2. Human learning largely performed on-policy, while pretraining is primarily off-policy. This means that humans can seek out the information that they specifically lack, and receive feedback to address their specific mistakes. I predict that the shift towards RL (and more recently on-policy distillation) is the first phase of a broader transition towards primarily on-policy training pipelines that will bring gains in both sample and parameter efficiency.

simulus 8 May 2026 5:15 UTC
1 point
0
in reply to: RobinHa’s comment on: RobinHa’s Shortform
I like this idea.
Models with access to a python interpreter might be able to solve it trivially by calling a random function.
I wonder if there are examples of RL training on (more useful) tasks like this where reward is predicated on the distribution of the model’s outputs over multiple samples.

simulus 29 Apr 2026 7:06 UTC
11 points
0
on: Are LLMs not getting better?
I think part of this is due to the choice of tested models: Sonnet Sonnet, Opus, GPT-5, Sonnet. If the tested models were consistently flagship offerings (Opus, GPT-whatever), then the trend would be more clear.
Furthermore, the most recent model that they tested was Sonnet 4.5. Most people agree that there was a noticable jump in quality around Opus 4.5/4.6, and that is missing from the graph. Opus 4.5/4.6 also seemed to kick off a trend of labs focusing heavily on agentic coding, which would change the slope of the trend.

simulus 29 Apr 2026 5:38 UTC
1 point
0
on: Empirical calibration of Polymarket: Analysis of 7,661 resolved binary markets
In section 1 of your findings, the “Expected” column of the table is misleading. It assumes that probabilities within each bucket are uniformly distributed (or more generally, symmetrically distributed around the center of the bucket’s range).
A more faithful Expected value would be the average over each market’s probability within a given bucket. This is the true rate that perfectly calibrated markets would resolve at.
I suspect that this is the cause of the significant discrepancy between the Expected and Actual resolution rates near the 0% and 100% extremes.

simulus 26 Apr 2026 7:27 UTC
7 points
0
on: The paper that killed deep learning theory
Can you speak more on what the experimental results would have looked like if theories like VC dimension had been correct?

simulus 30 Mar 2026 17:49 UTC
1 point
−1
in reply to: Dagon’s comment on: ghost-in-the-weights’s Shortform
My point is that since so much research goes into building models that maximize advertising results, an ad click maximizer is the most likely case of blatant outer alignment failure. Ad clicks will be its end goal.

simulus 30 Mar 2026 8:07 UTC
−1 points
−3
on: ghost-in-the-weights’s Shortform
The paperclip maximizer will be an ad click maximizer.
Clarification: Since so much research goes into building models that maximize advertising results (even being the reason Google started building TPUs), an ad click maximizer is the most likely case of blatant outer alignment failure from a poorly chosen objective.

simulus 17 Mar 2026 7:35 UTC
1 point
0
on: You can’t imitation-learn how to continual-learn
This knocks on the door of a principle that I have been playing with for a while: a good continual learning and/or sequence modelling algorithm should converge to some known behavior. Architectures like attention have undefined behavior once the end of their training context length is reached. SGD on the other hand can be run indefinitely, because we know that it will eventually converge to an interpolation of the data.

simulus 28 Feb 2026 5:37 UTC
3 points
0
in reply to: Brendan Long’s comment on: ajskateboarder’s Shortform
Someone I know claims to have found a way to directly pretrain neuralese models: https://aklein.bearblog.dev/zebra/
I’ve seen their prototype, and it definitely works (as far as producing reasonable text outputs while making non-trivial use of >100 continuous latents), but whether it actually amounts to anything remains to be seen.

simulus 26 Feb 2026 19:43 UTC
6 points
0
in reply to: lilkim2025’s comment on: lilkim2025′s Shortform
This appears true at the academic scale, but not at the frontier scale where RL compute consumption is much higher (sometimes even higher that pretraining).
As a counter-example to your evidence, when Nvidia scaled up their RL they found:
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights —RL can indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that our model generates novel insights and performs exceptionally well on tasks with increasingly difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond its initial training. Most strikingly, we identify many tasks where the base model fails to produce any correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100% pass rates (Figure 4).

simulus 24 Feb 2026 6:44 UTC
18 points
0
on: ghost-in-the-weights’s Shortform
Steerling-8B: The First Inherently Interpretable Language Model
This is probably worth a deeper discussion, but Guide Labs is claiming that their new model is “the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data”.
Reading the blog, Steerling is basically just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder before the LM head.
They also appear to apply a loss that aligns the SAE’s activations with labelled concepts (correct me if I’m wrong). However, this seems like an obvious example of The Most Forbidden Technique, and could make the model appear interpretable without the attributed concepts actually having causal effect on the model’s decisions.
Can we get some input from interpretability folks? I’m obviously bearish.
Link to release post: https://www.guidelabs.ai/post/steerling-8b-base-model-release/

simulus 9 Feb 2026 5:39 UTC
−8 points
0
in reply to: Hruss’s comment on: Hruss’s Shortform
...

simulus 6 Feb 2026 22:59 UTC
4 points
0
in reply to: DirectedEvolution’s comment on: AllAmericanBreakfast’s Shortform
I wonder if this is an artifact from the training data.
There are probably more edge-case bugs in published code (or even intermediate commits) than there are obvious bugs.

simulus 3 Feb 2026 16:49 UTC
4 points
0
on: ghost-in-the-weights’s Shortform
A recent paper probed LLMs and located both value features (representing the expected reward) and “dopamine” features (representing the reward prediction error). These features are embedded in sparse sets of neurons, and were found to be critical for reasoning performance.
Could these findings have any implications for model welfare?
If a model had mechanisms for “feeling good and bad”, I imagine they would look similar to this.
The paper in question: https://arxiv.org/abs/2602.00986

simulus 31 Jan 2026 1:12 UTC
2 points
0
in reply to: Seth Herd’s comment on: Are We in a Continual Learning Overhang?
Yes, I am referring to the lack of learning-to-learn data during initial training.

Your point that humans have built-in mechanisms for continual learning is similar to what I’m saying about inductive biases: if we don’t have the data to train continual learning into models, we need to build it into the architecture.
However, I think the ‘data’ from which humans learn during development (on-policy interactions with the environment with constant feedback and something like rewards) is much more aligned to continual learning than books and pdfs.

simulus 29 Jan 2026 18:25 UTC
5 points
1
on: Are We in a Continual Learning Overhang?
I believe that the biggest bottleneck for continual learning is data.
First, I am defining continual learning (CL) as extreme long-context modelling with particularly good in-context supervised and reinforcement learning. You seem to have a similar implied definition, given that Titans and Hope are technically sequence modelling architectures more than classical continual learning architectures.
Titans might already be capable of crudely performing CL as I defined it, but we wouldn’t know. The reason is that we haven’t trained it on data that looks like CL. The long-context data that we currently use looks like pdfs, books, and synthetically concatenated snippets. None of that data, if you saw a model producing it, would you consider to be CL. The data doesn’t contain failures, feedback, and an entity learning from them. If we just trained the architecture on (currently non-existent to the public) data that looks like CL, then I think we would have CL.
The obvious solution to this problem is to collect better data. This would be expensive, but the big players could probably afford it.
Another solution that I see is to bake a strong inductive bias into the architecture. If CL is an out-of-distribution behavior relative to the training data, then the best option is an architecture that “wants” to exhibit CL-like behavior. Taken to the extreme, such an architecture would exhibit CL-like behavior without any prior training at all. One example would be an “architecture” that just fine-tunes a sliding-window transformer on the stream of context. Of the current weight-based architectures, I think E2E-TTT is the closest to this vision, since it is essentially meta-learned fine-tuning.
The final solution is to use reinforcement learning instead of pretraining to get CL abilities. If getting high rewards necessitates CL, then we would expect RL to eventually bake in continual learning. The problem is that RL is just so costly and inefficient, and we lack open-ended environments with unhackable rewards.

simulus 29 Jan 2026 5:26 UTC
3 points
0
on: Is the Gell-Mann effect overrated?
Anecdotally, as someone who works on non-AGI-targetting AI research, I find pop-sci articles on AI research to be horribly misrepresentive.
A paper that introduces a new algorithm that guides drones around a simulator by creating sub-tasks might be presented as “AI researchers create a new kind of digital brain—and it has its own goals”. That’s obviously a click-bait headline, but the article itself usually does little to clean things up.
However, I would imagine that AI is currently among the worst fields for this kind of thing due to manufactured hype, culture wars, and the age-old anthropomorphization of AI algorithms.