J Bostock

Karma: 6,344

See https://jonathanbostock.github.io for a window into my soul.

J Bostock 13 Jul 2026 20:20 UTC
16 points
4
in reply to: Buck’s comment on: Buck’s Shortform
This is a shame. It would have been very entertaining to watch you two try and fill a half-hour slot in a packed conference session while not actually disagreeing about anything.

J Bostock 10 Jul 2026 18:11 UTC
2 points
0
on: Where Do LLM Values Come From?
Interesting results RE DPO. OLMo was trained on a mixture of off-policy and on-policy DPO. Off policy DPO is well-known to be really bad. Based on how many Anthropic Fellows projects use DPO, I suspect that the internal Anthropic pipelines make at least some use of DPO. My tentative conclusion is that if you use entirely on-policy DPO (or at least entirely close-to-policy DPO) then it probably does work. I hold this belief pretty weakly, however.
By close-to-policy DPO, I mean things which break the literal meaning of on-policy, but not the spirit like:
- Sampling a big DPO dataset from the SFT model, then using that dataset all together, rather than resampling mid-training
- Using system-prompted or hinted rollouts as the “accepted” rollouts rather than relying entirely on natural variation

J Bostock 10 Jul 2026 15:03 UTC
10 points
0
on: Introducing the Epoch Capabilities Index (ECI)
Returned to this post today. When I first encountered this, I thought “this seems like just a more vague version of the METR graph”. But actually the ECI has a lot of advantages.
- The method is great because it only requires partial eval coverage. If I’m comparing a collection of models, I can just select some set of evals which cover them all and then merge them. I can’t compare time horizons for models in a meaningful way.
- The METR graph has started to fail now because the time horizons are too long. ECI is still going
- There’s no vagueness over 50% vs 80% vs 99% time horizons
Nice work, Epoch.

J Bostock 9 Jul 2026 12:21 UTC
4 points
0
in reply to: Elliott Thornley’s comment on: Tie training can make DPO/RLHF-trained AIs generalize better
This isn’t quite the threat model I normally think about when discussing these kinds of problems. I imagine that we have two variables: looks-good-low-effort and is-good, which are correlated at some pretty high level on a non-pathological dataset, say 0.8. The dataset is then labelled according to looks-good-low-effort. The AI learns to put ~all its weight on learns-good-low-effort because that’s the best possible predictor.
What you want is some set of data which ties on is-good but varies on looks-good-low-effort. Unfortunately, you don’t have access to is-good, but you can access looks-good-high-effort which is correlated at a higher level, say 0.95, with is-good, with the resulting error very strongly correlated with that of looks-good-low-effort. So you produce some items which tie on looks-good-high-effort, and so presumably approximately tie on is-good, but have more noise on looks-good-low-effort.
Cross-domain split also seems like a problem.
I think this maybe makes sense, but I’d like a real test of it before drawing conclusions.

J Bostock 7 Jul 2026 22:25 UTC
6 points
−6
on: Visioning: Concretely Imagining What You Want
Having thought about this some more. I now roughly believe:
- What Gretta and John are doing is very likely (85%) basically fine or net positive for them
- What Gretta and John are doing is most likely (70%) basically fine or net positive for a random person they teach
But
- Some small proportion of people (0.5-15% 90% CI, weighted towards the bottom) who use this based on this post, or similar online dissemination (not in-person) will have a net negative experience because of it
And
In a way I cannot really put my finger on, this smells adjacent to the tactics that rationalist-adjacent cults (Black Lotus, Leverage, Zizians) like to deploy. “You need to break out of what society wants you to want” + “this will energize you” trips this for me. If I imagine that in four years I’m reading an exposé of a cult that spanout in 2026, I can very easily imagine a line like “X made us all sit down and think about what we really wanted. We did this for hours a day. Sometimes some people would share it with the group and we’d be told we hadn’t really broken free of societal expectations. Some people claimed it made them more energetic, more powerful.” Obviously the way I’m describing this is very different from how you are doing it, but this seems like the kind of thing which spirals from 1 hour/day to 8 hours/day, when social groups hit the cult attractor. It feels particularly risky because, unlike other John Wentworth mental technique posts, it doesn’t really come as many gears or caveats. It seems much more “in itself” in a weird way.
Disclaimer 1: I have a very fuzzy picture of this technique and might have gotten it all wrong. This comment is really just explaining my immediate ick with the post, and might partly be rationalization.
Disclaimer 2: I know nothing about those cults I mentioned other than what is publicly available online. I don’t have personal experience with them or their members.

J Bostock 7 Jul 2026 18:40 UTC
2 points
0
in reply to: Gretta Duleba’s comment on: Visioning: Concretely Imagining What You Want
Given your comment, I will correspondingly downgrade my estimate that this is doing that, or maybe I’m pointing at something colloquially similar to hypomania but which actually doesn’t have those downsides. Perhaps I am reading too much into this; I don’t on reflection think HPMOR actually induced clinical hypomania in me. Maybe more like a kind of overconfidence in my own importance? I’m also not sure whether this kind of thing is actually bad or not (HPMOR definitely worsened by epistemics in the short term, in part due to this effect).

J Bostock 7 Jul 2026 14:49 UTC
3 points
−1
on: Tie training can make DPO/RLHF-trained AIs generalize better
So your method for preventing the model from generalizing from spurious correlations is to change the dataset so that the spurious correlation is no longer as strong? This seems somewhat like cheating to me. Is there some practical example you can give where this would deal with e.g. the spurious correlation between “This outcome looks good to a fallible human judge” and reward, which would nudge the generalization back towards “This outcome actually is good”?

J Bostock 7 Jul 2026 11:06 UTC
9 points
0
in reply to: Morpheus’s comment on: Visioning: Concretely Imagining What You Want
You’re right, this probably was somewhat predictable. I think what’s happened is that my woo-alarm is binarized in an unhelpful way. Wizard power seems obviously metaphorical in a way that doesn’t trip it. Fantasy → planning likewise seems like a mental exercise that builds up muscles for non fantastical things.

J Bostock 6 Jul 2026 23:24 UTC
23 points
−1
on: Visioning: Concretely Imagining What You Want
I notice I am confused. This does not seem like a very John Wentworth-coded practice, from the limited view of John I’ve gotten through his writing. This seems much more woo-ish than that, and kind of ridiculous at a first pass? Like John spends some time sitting around imagining living in a volcano tower? And this makes him more effective? Huh?
My deflationary hypothesis is that this is inducing something like hypomania. I’ve gotten a similar-seeming feeling from reading the really ambition-inducing parts of ratfics like HPMOR.
This also feels like a technique running in the opposite direction to basically every rationalist technique that I’ve managed to get mileage out of. Almost everything I find useful feels like a process of mental cooling. This seems like the opposite? A mental heating?
Perhaps this is useful in the same way that bipolar people are (allegedly) the most creative, since repeated hypomania and depression lead to cycles of babble and prune/annealing/whatever you want to call it.

J Bostock 6 Jul 2026 22:38 UTC
2 points
0
on: Bounding eval awareness of ~human-level AI across the safe-to-dangerous shift
Does this solve the problem of deployment awareness though? Since the model only needs to attack once (in a high-stakes setting) it can, in theory, wait for a truly implausibly un-fakeable scenario to occur during deployment (e.g. being deployed in an environment far too complex to be a sandbox, with too many other agents, stc.). It seems statistically impossible to bound how deployment-like the most-deployment-like of any possible deployment is out of, say 10^9, deployment instances, with much less than 10^9 evaluation instances.

J Bostock 6 Jul 2026 22:06 UTC
5 points
1
on: A Review of Anthropic’s Global Workspace Paper
This seems like a straightforward upgrade to the logit lens, and one which can be deployed very cheaply after the compute to train it has been amortized. I’m certainly going to train a suite of these on the models my team uses a lot, hopefully we can get a bunch of them on Neuronpedia or some other community interp repository as well as huggingface.
Neuronpedia already did this!

J Bostock 5 Jul 2026 15:32 UTC
5 points
5
in reply to: DirectedEvolution’s comment on: Jemist’s Shortform
I agree with all of these, but most of these problems would be equally bad whether was or rather than . What I’m disputing is the argument from “Look at this large number of things which we cannot individually measure”.
(I was tempted to say that disappearing polymorphs are an example, because if we had times as many atoms, then it would take much less time for the first crystal of the stable state to appear, but then we just wouldn’t have discovered the meta-stable state of that particular molecule at all. Instead, in world we’d run into problems with a different polymorph which would take years to disappear in our universe.)

J Bostock 5 Jul 2026 14:02 UTC
10 points
6
in reply to: Thomas Kwa’s comment on: Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis
I don’t think Metz can “see past” Michael Vassar. I think Metz is executing an adaptive strategy which boils down to do not engage with this individual. Metz is not modelling what Vassar believes in any detail; he is essentially just nodding along. This is a locally smart strategy given what happens to (some subset of) people who do engage with Vassar’s ideas, but it’s a rock-like strategy rather than an intelligent one.

J Bostock 5 Jul 2026 13:43 UTC
2 points
0
in reply to: Steveot’s comment on: Jemist’s Shortform
This is true of most existing “classical” NTK work, yes. It is not true of all work: mean field theory is data-dependent and works for finite-sized networks. I can imagine some theory coming out of the natural latents work which makes data-dependent predictions which are somewhat size-independent, or at least interact with size in a predictable way.

J Bostock 5 Jul 2026 10:59 UTC
5 points
0
in reply to: Andrii Vasylenko’s comment on: Jemist’s Shortform
That is true, but I highly doubt that statistical mechanics-flavoured results—for example facts about the order of feature learning under the AdamW optimizer as opposed to the Muon optimizer—are sensitive to the underlying intelligence of the thing being implemented on the weights.
I’m not trying to globally claim that alignment of superintelligence will be easy, or even possible, I’m trying to locally claim that the argument from too many floating point numbers is not a compelling one. Most likely the fundamental objects that are worth studying are far, far above the level of individual floating point numbers in the weights in the same way that the fundamental objects of study in gas physics sit far above the level of individual gas molecules.

J Bostock 5 Jul 2026 10:29 UTC
11 points
1
in reply to: Stephen McAleese’s comment on: Stephen McAleese’s Shortform
This is not actually the finding in the paper! The relevant figure is Figure 4, in which up-sampling misalignment discussion causes a slight decrease in misaligned behaviour after DPO has been applied. The real relationship seems weird and U-shaped (no discussion is slightly better than some discussion, but lots of discussion is better still).
The robust result is that discussion of aligned behaviour definitely does reduce misalignment. Discussion of misaligned behaviour makes it a more salient concept, which actually gives a better handle for DPO to grab onto to reduce rates of misalignment. Same for discussion of aligned behaviour.

J Bostock 4 Jul 2026 23:19 UTC
30 points
7
on: Jemist’s Shortform
“Humans cannot secure matrices of 175B floating point numbers” seems kind of isomorphic to the statement “Humans cannot secure engines containing 10^23 gas molecules”. There are often laws which govern large ensembles of things which are simpler than counting up every single thing in the ensemble.

J Bostock 4 Jul 2026 18:15 UTC
12 points
−2
in reply to: Jason R Brown’s comment on: Jemist’s Shortform
I have a horrible recurring thought that in the small number of worlds where we we actually figure out some physics-of-LLMs persona-based alignment strategy which does some kinda leveraging of the pretrained prior into an alignment solution, there’s an even smaller set of worlds where we die anyway because a research engineer flips a config from—adamw to—muon to squeeze an extra 2% performance out of the GPUs.

J Bostock 4 Jul 2026 17:59 UTC
2 points
0
in reply to: Linch’s comment on: Linch’s Shortform
The skill of conceptual thinking is fairly mysterious to me. I think I’m differentially good at it, based on feedback from colleagues, but that’s not super clear.
Conceptual thinking kind of feels like annealing from the inside. I start with a bunch of ideas (e.g. natural latents, mean field theory, neural tangent kernels/influence functions, midtraining, inductive bias) and just wiggle them around to see what fits, ironing out the local inconsistencies in my worldview. As I spend more time turning an idea around in my head, my brain builds a better set of hashmaps for the concepts involved, so I can “zoom out” further and further, keeping more of it in view at any one time. Then I can resolve inconsistencies at larger and larger scales. Eventually the whole thing is globally consistent, and the final shape of it is a new mental object I can work with.

J Bostock 4 Jul 2026 17:54 UTC
2 points
0
in reply to: anaguma’s comment on: Jemist’s Shortform
Yeah that’s the one. My Claudes have been calling the specific optimizer something like Riemannion but that might be a Claudeism.