J Bostock

Karma: 3,903

J Bostock 5 Feb 2026 0:33 UTC
3 points
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
Writing insecure code when instructed to write secure code is not really the same thing as being incorrigible. That’s just being disobedient.
Training an AI to be incorrigible would be a very weird process, since you’d be training it to not respond to certain types of training.

J Bostock 4 Feb 2026 21:20 UTC
2 points
0
on: AI Safety has a scaling problem
Research bounties have an extremely serious flaw: you only get the money after you’ve done the work, while you probably need to pay for food, rent, and compute today.
My current situation is that I would love to do more work in technical alignment but nobody is paying me to do so, and I need to keep the lights on.
Smaller bounties could be a nice bonus for finishing a paper: if I had the option to take a grant paying something like £40k/year, I could justify working for 9 months to probably get an extra £20k bounty. I cannot justify working for free for nine months to probably get a £60k bounty, because if I fail or get scooped I’d then be dead broke.
So we still need grantmakers who were willing to sit down, evaluate people like me, and (sometimes) give them money. Maybe someone would fill that niche, e.g. giving me a £40k salary in exchange for £45k of the bounty. But if lots people were capable of doing that, why not just hire them to make the grants! This seems much easier than waiting for the good research-predictors to float to the top of the research bounty futures market.

J Bostock 4 Feb 2026 17:48 UTC
5 points
2
on: Vibestemics
I don’t have great faith in the epistemics of postrats as they exist today. My somewhat limited experience of post-rattish meetups and TPOT is that it’s a mix of people who are either indistinguishable from rats (and indeed lots are just rats), people who are mostly normie-ish and don’t think about epistemics, and totally woo people who are obviously wrong about lots of things (astrology, karma, UFOs) with no epistemic gain.
My guess is what’s happening is that the rationalist frame is 80% correct, and the best alternative is normie epistemics in the remaining 20% of time. The first type of “postrats” just use the rationalist frame. The second type swap in some amount of normie epistemology, but not in a way which correlates with the times they should actually be swapping in normie epistemology. The third type of postrats are swapping a woo/religious frame into the rationalist frame, which seems mostly just worse than the rationalist.
The second and third groups do have better interpersonal skills than rats, but I think this is mostly just regression to the mean.

J Bostock 4 Feb 2026 13:53 UTC
2 points
0
on: On Goal-Models
Conversely, a high school student who wants to be the CEO of a major company is so far away from their goal that it’s hard to think of them as controlling their path towards it. Instead, they first need to select between plans for becoming such a CEO based on how likely each plan is to succeed.
I think this (and your dancer example) still qualifies as control in @abramdemski’s framework, since the student only gets to choose one possible career to actually do.
The selection—control axis is about the size of the generalization gap you have to cross with a world model. An example of something more selection-ish would be a dancer learning half a dozen dances, running through them in front of his coach, and having the coach say which one she liked best. In this case we’re only crossing a gap between the coach’s preferences and a competition judge’s preferences.
Choosing in advance which dance to learn still counts as control because you have to model things like your own skill level in advance.

J Bostock 4 Feb 2026 11:55 UTC
9 points
7
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
As an aside, it’s quite funny that Scott of all people decreased his “I am special and can change the world” estimate given that he clearly is special and can change the world. The US Vice President literally reads his blog sometimes!

J Bostock 31 Jan 2026 12:10 UTC
4 points
−1
in reply to: Vladimir_Nesov’s comment on: On The Adolescence of Technology
I think this is unreasonably hopeful. I think it’s likely that AI companies will develop a superhuman researcher mostly out of RLing for doing research, which I would expect to shape an AI whose main drive is towards doing research. To the extent that it may have longer-horizon drives beyond individual research, I expect those to be built around, and secondary to, a non-negotiable drive to do research now.
(At risk of over-anthropomorphizing AIs, analogize them to e/accs who basically just wanted to make money building cool AI stuff, and invented an insane philosophical edifice entirely subservient to that drive.)

J Bostock 29 Jan 2026 15:18 UTC
4 points
0
in reply to: rba’s comment on: How Articulate Are the Whales?
Intraclick and intracoda are very different timescales from a quick Google. Intra-click would indeed be ms-scale, but intra-coda seems like it would be 10s or 100s of ms, which is well within possibility.
I think the stronger evidence that it’s false comes from the fact that the two vowels differ in undertones rather than overtones, which is unlike human vowels (not sure about bird vowels).

J Bostock 28 Jan 2026 22:12 UTC
27 points
7
on: How Articulate Are the Whales?
This analysis seems right to me, but I think you’re wrong to compare the sperm whales to Koko. It seems a priori much more likely that a given animal’s sounds carry meaning, than it is that any given animal can be made to use human language.
Whale noises must communicate some information, or they wouldn’t evolve to make those noises, so we’re just haggling over the information density. Conversely, if gorillas and chimps were capable of learning complex sign language for communication, we’d expect them to evolve/culturally develop such a language.
(Though whales might just be saying “I am large and over here” so the information density may be very low)

J Bostock 23 Jan 2026 18:47 UTC
2 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I think humans do value-inference in both directions. Our “terminal” values are in part grown out of which high-level things correlate with low-level things. An example is John Wentworth—who seems to lack the circuits for feeling (companionate) love—saying he thinks relationships and kids are lame compared to his goals of saving the world, to the point where he would prefer not to be modified to be able to feel love, and says he would view a drug which enabled him to feel love similarly to a syringe of heroin. Clearly, his brain has built up terminal values V(saving world) > V(wife and kids) out of his lower-level instincts.
Sexuality seems to be another case of this; it is interesting just how variable it can be. Just take a look at the amount of variance in how attracted monosexual people are trans people of the gender (and conversely not the assigned-at-birth sex) that they’re attracted to. Some people’s values infer the “right” gender cues and ignore the mechanical ones, some just don’t. (though I will admit I’m going off of zeitgeist here and have basically no experience in this domain)
My best guess is that the brain is re-using its epistemic inference circuits (which are good at taking in information, gestalting it, and penalizing by complexity) and running a kind of “What are my values?” inference, which seeks a relatively non-contradictory, relatively simple value system, then doing it’s own equivalent of backpropagation to smooth all the conflicting lower-level drives towards that, similarly to how it can smooth over conflicting predictive circuits to make its own models more consistent, just by thinking (i.e. without external input).

J Bostock 23 Jan 2026 11:14 UTC
4 points
0
on: Jemist’s Shortform
I’ve been thinking about “trustedness” when it comes to AI, my current thinking goes something like this:
An AI is trustworthy on a given domain if we can build an eval which lets us predict the behaviour of that AI on that domain, when it is deployed.
For example, if you’re doing OCR and your train, validation, test, and deployment datasets are all IID (like if you’re reading in a bunch of old books from a set of a million in some library) then you can trust your OCR AI.
The problem is that IID is an insanely strong condition. This example rules out self-driving cars, since the training distribution will be data from before the car is released. Instead we might want to make a weaker claim.
Hand-wavey hypothesis: an AI is trustworthy if it is not using an internal abstraction which tracks whether it is being evaluated.
This made me think of Anthropic just straight-up zeroing the SAE features corresponding to evaluation awareness in Claude as part of their experiments.^[1] I don’t expect this to be a robust strategy as currently used^[2], but maybe if you’re building super robust causal models of LLMs, you can do something along these lines.
1. ^
  https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf
  Part 7.6.4, pg 99.
2. ^
  SAEs just aren’t quite there when it comes to catching all the information in a model.

J Bostock 19 Jan 2026 13:22 UTC
10 points
7
in reply to: Jordan Taylor’s comment on: Why we are excited about confession!
I think if human-level AIs were going to be capable of making great strides in scalable alignment work, we would have seen more progress from human-level humans. The fact that a large chunk of the field has converged on strategies like “Get another person to do the work” (i.e. fieldbuilding work, organizing mentorships, etc.) or “Get an AI to do the work” (i.e. AI control, superalignment) or “Stop or slow the building of AGI and/or make the builders of it more responsible” (i.e. policy work) is a very bad sign.
The total progress being made on the real meat of alignment is very low, compared to the progress being made in capabilities. I don’t see why we should expect this, or the distribution of resources, to suddenly flip in favour of alignment during the middle of the singularity once human-level AIs have been developed, and everything is a thousand times more stressful and the race dynamics are a thousand times worse.

J Bostock 19 Jan 2026 0:27 UTC
11 points
3
in reply to: Dalcy’s comment on: Darcy’s Shortform
Plan A: guillotine my head into a pre-prepared vat of formaldehyde and propylene glycol somewhere in the middle of the Greenland ice sheet.
Further refinements may be necessary.

J Bostock 16 Jan 2026 13:50 UTC
4 points
2
on: Why Motivated Reasoning?
Epistemic status: kinda vibesy
My general hypothesis on this front is that the brain’s planning modules are doing something like RL as inference, but that they’re sometimes a bit sloppy about properly labelling which things are and are under their own control.
To elaborate: in RL as inference, you consider a “prior” over some number of input → action → outcome loops, and then perform a Bayesian-ish update towards outcomes which get high reward. But you have to constrain your update to only change P(action | input) values, while keeping the P(outcome | action) and P(input | outcome) values the same. In this case, the brain is sloppy about labelling and labels P(tired | stay up) as something it can influence.
This might happen because of some consistency mechanism which tries to mediate between different predictors. Perhaps if it gets one system saying “We will keep playing video games” and another saying “We mustn’t be tired tomorrow” then the most reasonable update is that P(tired | stay up) is, in fact, influencable.

J Bostock 16 Jan 2026 12:13 UTC
4 points
8
in reply to: leogao’s comment on: Ryan Meservey’s Shortform
Certainly I don’t expect people who already have jobs at top AI companies to start worrying about this. Anyone with an OpenAI job is probably already at, or close to, the top of their chosen status ladder. In the same way, a researcher who gets Nature papers regularly has already made it.
My impression is that too people in the lab-independent AI safety ecosystem are already being tempted by two money-status games: being tempted by the money+status of working at a lab, and being focused on the status of getting top ML conference papers. Adding a third status game of traditional journal publishing would just make these dynamics worse.

J Bostock 15 Jan 2026 23:50 UTC
3 points
0
in reply to: Jozdien’s comment on: Ryan Meservey’s Shortform
Yes, I had not quite considered the conferences! I come from a field where Nature is mostly spoken of in hushed tones. The last three years of my lab work are being combined into a single paper which we’re very very ambitiously submitting to Nature Communications^[1] which—if we succeed—would be considered a very good outcome for me. If I were to achieve a single first-author Nature publication at this stage in my career then I expect I would have decent odds of having an academic career for life.^[2]
As you might expect, this causes absolutely awful dynamics around high-impact publications, and basically makes people go mad. Nature and its subsidiaries can literally make people wait a whole year for peer review, because people will literally do anything to get published in it. A retraction from a published journal is considered career-ending by some, and so I personally know of two pieces of research which have stayed on the public record even though one was directly plagiarised and another contained some research which was just false. When these were discovered, everyone kept quiet because it would ruin the careers of anyone whose names had been on the paper.
1. ^
  Nature Comms is for papers which aren’t good enough for Nature, or in my case, Nature Chemistry either. Nature is the main journal, with different subsidiaries for different areas, and Nature Comms is supposed to be for work too short—and too mediocre—to make it into Nature proper. In practice it’s mostly more mediocre work that’s cut to within an inch of readability, rather than short but extremely good work.
2. ^
  Of course I would have to keep working very hard, but a Nature publication would likely get me into a fellowship in a lab which gets regular top-journal publications, and from there getting a permanent position would be as achievable as it gets.

J Bostock 15 Jan 2026 23:32 UTC
10 points
0
in reply to: Kaj_Sotala’s comment on: Why we are excited about confession!
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.^[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.^[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.^[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
1. ^
  Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
2. ^
  https://arxiv.org/html/2505.13995v2
3. ^
  The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.

J Bostock 15 Jan 2026 14:33 UTC
11 points
9
in reply to: Fabien Roger’s comment on: Why we are excited about confession!
At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!

J Bostock 15 Jan 2026 14:05 UTC
15 points
9
in reply to: Ryan Meservey’s comment on: Ryan Meservey’s Shortform
Firstly, well done. Publishing in high impact journals is notoriously difficult. Getting outsider-legible status is probably good for our ability to shift policy.
Secondly, I’ve been very happy with the AI safety community’s ability to avoid this particular status game so far. Insofar as it’s valuable to be legibly successful to outsiders, publishing is good. I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals. Nobody seems to have tried too hard before, and I would guess this paper was in part riding on the fact that the entire field is fairly novel to the Nature editors. Some part of me feels like publishing here was a defection, albeit a small one. I expect it will only get more difficult (for people other than Owain, once you have one Nature paper it’s easier to get a second one) to publish in famous journals from here on out, as they see more and more AI safety papers.
I think it would be very bad for everyone if “has published in a high-impact journal” becomes a condition to get a job or a grant.

J Bostock 10 Jan 2026 0:12 UTC
5 points
0
on: Where’s the $100k iPhone?
Time-telling: ~$10 Casio through to a ~$100k Rolex
This is an odd one out: both of them will tell you the time about equally well (as indeed will any phone). The purpose of a Rolex is fashion and/or status signalling.
The transport one includes “a custom Bentley” which is probably also mostly status signalling compared to a Range Rover or Tesla.
Most of the others seem to be material upgrades (e.g. a private chef lets you have tasty food every day, exactly to your tastes).
A $100k iPhone would struggle to set itself apart from a merely “good” iPhone in terms of material quality. (The standard way to scale up computers is usually by adding more computer, which quickly becomes too big to fit in one’s pocket.) I suspect the most relevant comparison would be the Rolex: it would do the same job a bit better, but mostly be for status signalling. Currently, nobody wants to status signal with their iPhone, so they choose not to.

J Bostock 7 Jan 2026 19:18 UTC
2 points
0
in reply to: Guillaume Martres’s comment on: How hard is it to inoculate against misalignment generalization?
Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.