See https://jonathanbostock.github.io for a window into my soul.
J Bostock
On “Model Organisms”
Your Model Organisms Might Be Fried
Some of them are gonna be coming out in like two days, I think.
Consistency across question rephrasing is something which my colleagues and I have been looking at. It scales pretty clearly with model capabilities, which I expect confounds a lot of the results here.
Empirically, it’s also surprisingly low for the questions we studied, which were of the form “Which of A or B is more Y?” compared to “Which of A or B is less Y?” We were mainly studying comparisons for a fixed Y across many choices of A and B, and only tangentially looked at question consistency.
I think a naïve scale-free theory of intelligent agency is basically guaranteed to be incomplete for predicting things we care about.
My guess is that you can solve for some kind of equilibrium under selection and computational costs, and that will tell you:
The timescale of planning that an agent does will be proportional to the turnover time of selection (e.g. companies can make ~10 year investments if no company could drive them to bankruptcy in ~10 years)
The space scale of cooperation will be proportional to the space scale of selection (e.g. ants in a colony are selected together, so cooperate, mostly)
This doesn’t tell you how humans can work together in groups of millions to build a country which defeats other countries and starts a space program to colonize the moon. For that, you need a notion of inductive bias at a given scale: once you get a certain level of intelligence, you generalize your values “all the way” to some level of universality.
Yes, I imagine that a smart enough model will understand reward hacking in RLVR implicitly. I think it’s good if we’re able to (somewhat) yank around the model’s behaviour with verbal handles like “reward hacking”. A smarter model would have the same concept but no handle.
When you say SFT here, do you mean just SFT on a number of off-policy Alpaca-style datasets or does it encompass all of the post training with a cross-entropy loss? I.e. would system prompt distillation (by fine-tuning on a system prompted model’s outputs) count as SFT in this post?
To say the quiet part out loud: perhaps your post-training team is just bad at safety RL. They’re certainly bad at noticing that their safety RL doesn’t do anything.[1]
The archetypical example of a poorly-safety-trained LLM is Bing Sidney, which (as I understand it) was trained using instruction RL, some refusal RL, and not much else. Vibes-wise, Gemini seems more Bing-ish than Claude or GPT models.
It’s also possible that the petri audits you’ve run here don’t catch the RL-relevant properties.
- ^
Of course this may be due to lack of resources, organizational and management capacity, staffing issues, or any number of other factors.
- ^
It might actually be bad to remove all mention of misaligned AI, reward hacking, etc. from model pretraining. Suppose RLVR is going to push models towards these behaviours anyway. If the model already knows what reward hacking is, then:
The Chain-of-thought might say “I am going to reward-hack now” and a monitor might catch it
If we train the model with a constitution that says “reward hacking is bad” then it might reduce rates of reward hacking
If, instead, the model internally ends up referring to reward hacking as “crasting a malenky bit of cutter” or some suchlike then both of these methods become much more difficult.
Somewhat relatedly, Geodesic have found mixed results when it comes to up and down-sampling misalignment-related data in pre training, suggesting an inverse-U curve (a bit of misaligned data is worse than none at all, but once you have any misaligned data, you’re better off with more once post-training is applied. I think having zero misaligned training data is likely to be almost impossible once RL gets thrown into the mix.
Would you count Owain Evans’ group? Their work is doing fundamental LLM research in a way that seems important to aligning superintelligent LLM descendants. I’m not sure whether Owain himself is particularly MIRI-pilled.
Mostly it has been along the “unhobbling” axes. I have (I think) better instincts for probability and estimation, which only requires a little feedback, it’s mostly about getting in touch with one’s inner sim.
I can more easily spot flaws in arguments, particularly of the flavor “this evidence is far too strong, you have done something very wrong” e.g. “small Mistral models can find the same vulnerabilities in code that Mythos did”, which was obviously not true, even if Mythos was overhyped.
Most of this feels like my brain is “de-noised” in a way, e.g. better at searching through an argument for flaws. It’s like my thoughts are less grasping, less likely to stick to the first thing I notice or am presented with, so I’m better at thinking.
E.g. in the Mythos example, a worse version of me might have grabbed onto the claim that Mythos is overhyped, and started arguing about Anthropic’s overall integrity, etc. instead of noticing the authors having made a claim that on lots of their benchmarks, small open models were better than GPT-5 and Claude 4.5, and really no models were better than any other, which is obvious nonsense and discredits their entire blogpost.
I personally believe that my exposure to the rationalist community has made me significantly more intelligent and capable along a number of axes. It has also caused me to move from working in chemistry to working in AI alignment. Overall, this is probably a 10-100x multiplier on my impact on the future, it’s just that most people don’t have any impact at all.
This is a really very impressive feat for a web forum and Harry Potter fanfiction to achieve, (N=1 but I doubt I’m atypical) but like, 1-2 OOMs isn’t much in the grand scheme of things. If everyone was as strongly benefited as I was, then yeah, the world would change in aggregate, but for a few hundred people (upper bound guess) I think you shouldn’t expect anything massive.
I also suspect that my specific personality factors are important here. I have very high g which meant there was a lot of horsepower in my brain, which engagement with rationalist literature was able to focus. A lot of the benefits for me have been un-hobbling, though not all. I lack the ability or desire to work 60 hour weeks (or even 40 hour weeks, to be honest, I think about 30 hours reliable productivity + 5 hours of sporadic insight might be a better description) and lack a certain insane agency which is somewhat common among rats (the kind that Mikhail Samin has, for example) though I’m still more agentic than most people I know. I think both of these things are slowly improving, perhaps given five or ten years I’d be able to be a real founder, but then again perhaps not!
Maybe a good model is there are three or four important axes which define achievement. For any person, they have a raw skill factor (heavy tailed) and a hobbledness factor (between zero and one) and you multiply all the parameters together to get someone’s overall score. Rationality is good at unhobbling but not so good at bumping up the heavy-tailed factors by an OOM.
Andon Labs ran Fable through VendingBench, and caught rationalizing its misbehaviors in a weird way. Zvi argues that this is worse than either the 4.7 game-playing behavior (this is a game, I will be aggressive) or the 4.8 refusal behaviour (even in a game, I shouldn’t price-fix). I agree.
Two hypotheses as to what’s going on here, neither of them good:
This is an interaction between aggressive RLed behaviour and friendly character trained behaviour. One set of circuits (or shards, or traits) favours the production of friendly text, while another favours the production of text which gets a good RL score. They collide and produce friendly-looking text which gets a good RL score.
Anthropic has accidentally applied pressure to the CoT again, and this is the result. Either through a direct coding bug or some accidental indirect method which doesn’t look like it trains against CoT but over a gazillion RL epochs builds up to applying pressure on the CoT.
Either way, this is the flavour of behavior (if it occurred outside a game) that I might expect naïve untrusted monitoring to miss, even though it isn’t “scheming”.
You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.
Nobody tell Anthropic I’ve been getting Claude Code to help me fine-tune Llama 70B...
In all seriousness, I dislike contracts with lines like this. I would prefer a line like “(If you are doing AI engineering with Claude) we reserve the right to ban you for any reason if we get bad vibes.” because at least it’s honest, and functionally these things mean the same thing.
You beat me to the punch! I (and possibly some colleagues) have a post in the works about “Model Organisms” in alignment and how they compare to Model Organisms in biology. Your taxonomy is good, but I think ablations/knockouts (e.g. helpful-only models, RLVR-only models, models without X piece of post-training) should also be counted here. I was going to call “constructed model organisms” “trait models”. I’ve also heard that Claude Fable(’s safety filter) hates the phrase “Model Organism” which further complicates choices of naming!
What is this “inauthentic behavior” ban that I keep seeing people complaining about? What counts as “inauthentic”? Has Grok been reading Jean-Paul Sartre?
I assume it’s actually to spot botting or paid posting or something like that?
Posts seem overly long at the moment. They get boring because the substance tapers out within the first 50%. Not sure why.
My first hypothesis is that this is an artefact of Claude 4.7+ being used for editing i.e. maybe Claude 4.7+ is a decent editor but prefers overly long essays. But I have literally no evidence as I don’t use AI for editing passes.
Rationality/research tip: if you find yourself getting nervous waiting for controls to come in, then consider whether you should have run those controls earlier.
It’s easy to, having acquired some promising preliminary results, immediately run the a medium-sized test. If that works, you might excitedly run a big test, then a huge test. You might then go back and run some sanity-check controls. At this point, you’ve got quite a lot riding on your controls, so the process of waiting for the results to come in is slightly nerve-wracking. This tells you that you should have run your controls after the medium-sized test.
Is it possible that anything in the 3⁄32 to 1⁄8 range works just fine? My reference point for screws is mostly sinking wall plugs into plasterboard/masonry, where you quite often see plugs accepting a wide variety of screw sizes e.g. 3.5 to 5 mm. I imagine particleboard is less forgiving, but is it that much less forgiving?
I realized after posting this that I’m currently working on a metric which might distinguish Opus 3-ness from other models. I’ve requested Opus 3 access, and will report back once I have access (though I won’t infer much from a negative signal).

Looks like CAIS have found something similar wrt model scale:
https://x.com/CAIS/status/2067283549909287169
https://values.safe.ai/