Wei Dai comments on Wei Dai’s Shortform

Wei Dai 7 Mar 2026 6:06 UTC
LW: 22 AF: 8
3
AF
It seems that LLMs are not good enough at reasoning, even after being trained on ~all human output, such that you couldn’t amplify their capabilities to arbitrary levels through iterated amplification, so AI companies are mainly increasing AI capabilities via RLVR instead. Is this impression wrong, and how to update on it if not?
Aside from the potential implications on alignment (i.e., closing off one approach that seemed hopeful for some, at least for the foreseeable future), I wonder if this is a deficiency in LLMs (their architecture or how they’re trained), a deficiency in human reasoning (i.e., even HCH was never going to work), or a deficiency in reasoning itself (something like, there is no such thing as a reasoner that can be amplified to arbitrary capability levels purely through iteration)?
- the gears to ascension 7 Mar 2026 8:03 UTC
  3 points
  1
  Parent
  I think some mix of your options:
  1. there is definitely a fixable deficiency in LLMs. but, for HCH,
  2. HCH seems bottlenecked by 3:
  3. reasoning doesn’t compensate for lack of evidence (other than by making v-information converge to shannon information).
    3.1. also, it’s often massively compute-cheaper to just go get more evidence from reality than to figure it out by thinking.
    3.2. when chaos is involved, one often has less information than one might naively appear to; even to idealized shannon reasoning ie solomonoff induction, assuming we really live in a probabilistic universe (which we sure seem to), then measurement uncertainty is guaranteed to blow up if predicting the weather from finite samples.
  But also, like, a lot of capability gains are coming from “take a long time to figure this out, train to figure it out more quickly than that” as a core part of what’s going on inside RLVR. So I don’t think it’s a total wash either.
  - Mateusz Bagiński 7 Mar 2026 11:22 UTC
    2 points
    0
    Parent
    HCH seems bottlenecked by:
    Looks like you meant to write something more here? Or is the bottleneck what you wrote in point 3?
    But also, like, a lot of capability gains are coming from “take a long time to figure this out, train to figure it out more quickly than that” as a core part of what’s going on inside RLVR. So I don’t think it’s a total wash either.
    This is true. But also, in order for this to work in the case of a freaking big [space of things to be figured out], you need to start with some [good search heuristics]/[capacity to predict which search paths are promising to pursue]. It seems to me that tons of available [easily checkable-for-validity examples] on the internet of math, programming, etc, suffice to give you such heuristics/prediction/planning skills in the human range and somewhat extrapolate/enhance/refine them (via the “how could I have thought that faster?” method, as you say), but that this approach has its limits.
- StanislavKrym 7 Mar 2026 7:19 UTC
  1 point
  0
  Parent
  I suspect that this question is far too abstract.
  1. The “all human output” is such that, for example, the total amount of high-quality web text in the whole world is estimated to be at most ~160T tokens. For comparison, DeepSeek V3 was pretrained on 14.8 trillion tokens, KimiK2 was pretrained on 15.5 trillion tokens. As far as I understand, this fact alone likely prevents companies from training models with more than ~10T parameters.
  2. What I don’t understand is how IDA is to be done and how different the results of IDA and RLVR are. As far as I understand, pretraining on successful outputs increases P(next token of a successful output|previous tokens of the same output). I suspect that RLVR affects the model in a similar way, but I would rather see you describe a potential iterated amplification mechanism.
  3. I described some problems of CoT-based LLMs in response to a post by Seth Herd. Unlike the AIs which solve problems by stuffing tokens into the LLMs, watching the LLMs spit out the next token and come out none the wiser, human brains are wildly neuralese nets which are also capable of learning online. This suggests a deficiency in SOTA LLMs as compared to the humans.
  4. As far as I understand wholesale HCH, it has a brainlike AI give an answer, another brainlike AI read the answer, formulate another answer tossed to the third AI etc until the procedure converges. The lack of telepathic communication makes the procedure similar to the problems of LLMs. A human thinking for longer retains intuitions about the question and about similar things, increasing understanding and making the right answer more likely.
  5. As for reasoning itself, I would rather see the description of the iteration process.