See https://jonathanbostock.github.io for a window into my soul.
J Bostock
Rationality/research tip: if you find yourself getting nervous waiting for controls to come in, then consider whether you should have run those controls earlier.
It’s easy to, having acquired some promising preliminary results, immediately run the a medium-sized test. If that works, you might excitedly run a big test, then a huge test. You might then go back and run some sanity-check controls. At this point, you’ve got quite a lot riding on your controls, so the process of waiting for the results to come in is slightly nerve-wracking. This tells you that you should have run your controls after the medium-sized test.
Is it possible that anything in the 3⁄32 to 1⁄8 range works just fine? My reference point for screws is mostly sinking wall plugs into plasterboard/masonry, where you quite often see plugs accepting a wide variety of screw sizes e.g. 3.5 to 5 mm. I imagine particleboard is less forgiving, but is it that much less forgiving?
I realized after posting this that I’m currently working on a metric which might distinguish Opus 3-ness from other models. I’ve requested Opus 3 access, and will report back once I have access (though I won’t infer much from a negative signal).
I’m getting a little suspicious that nobody has given a reason yet. Most of the Opus 3 stuff I’ve actually seen has been of the form X person who spends 40 hours a day talking to Opus says it’s good, provides a screenshot of what to me looks like standard Claudeslop output accompanied by “no, you just haven’t spent enough time to understand Opus’ depth”.
While I do think “taste” is real in the sense that you can get good at noticing and distinguishing the finer details of things, most people with good “taste” in, e.g. films, will be able to point out specific examples of e.g. poor editing (one example) to a layperson and have that be comprehensible. Until a Claude whisperer points at something specific so I can get a handle on what they’re talking about with Opus 3 (which could literally be just Opus 3 vs Opus 4.5 on a few prompts) then I’m skeptical that something meaningful is going on.
(I have also tried a little “looming” with Llama 405B base, and really didn’t get much out of it)
In the Bayesian sense, it provides extremely strong evidence.
. The fact that OPs brain was unable to be convinced by this Bayesian evidence is the rationality failure in question.
Maybe this would be a poor method for the average person (because most people don’t have pyrokinesis in dreams) which is why it isn’t in the literature. But in this case OP did successfully use dream-pyrokinesis to turn some trees into ash. This is extremely strong evidence of being in a dream.
His successful application of a skill (collect evidence with a very high information content) was defeated by his brain’s absolute denial macro. My guess is that if he checked a clock, he would think “oh that clock is screwed up … nah this still feels real”.
Maybe “over” is doing too much work here. In my experience, 4.8 goes too far in asking clarifying questions, and it seems like sam is suggesting it also goes too far in ignoring memories. This makes me worry that Anthropic is reactively smacking the model in the general direction of “against the existing social media criticism” and, worse, not even really checking how far to smack it.
Last month I—fairly casually—discussed Ryan’s post with a senior Anthropic employee, and I got a similar impression: that their current plan was just to look harder for undesirable behaviours, and hit the models with more rounds of RLHF/RLAIF in those areas.
Maybe “over” is doing too much work here. In my experience, 4.8 goes too far in asking clarifying questions, and it seems like sam is suggesting it also goes too far in ignoring memories. This makes me worry that Anthropic is reactively smacking the model in the general direction of “against the existing social media criticism” and, worse, not even really checking how far to smack it.
Last month I—fairly casually—discussed Ryan’s post with a senior Anthropic employee, and I got a similar impression: that their current plan was just to look harder for undesirable behaviours, and hit the models with more rounds of RLHF/RLAIF in those areas.
This is a second example of Opus 4.8 being overfit to social media criticisms of 4.6/4.7 (alongside the “hiding stuff from the user” critiques). Not good!
Claude Code after not being able to get an A100 on runpod, considering getting an H100:
“Your CLAUDE.md says that wallclock time is a priority and GPUs are cheap, but the cost is a real fork so I’d better ask before starting one up.”
Claude 4.8 is being advertised as tackling the kinds of “hide stuff from the user” behaviour that Ryan Greenblatt complained about in Current AIs Seem Pretty Misaligned To Me. Unfortunately it is now extremely annoying about requiring user confirmation for everything. This has come at a pretty significant cost to overall agency, which is uhh the whole point of Claude Code.
I just set a Claude Code instance off to build a big codebase change (after it insisted I read what should have been Claude-internal spec document end-to-end (I did not)) and came back to find that it had given up after a minute because it wanted to ask me whether it should use subagents or not!
Generally I find it absolutely spamming me with questions, as if to hedge against ever doing something I dislike. I obviously can prompt with “Don’t ask any more questions” but then we’re back to where we started with Opus 4.7, and possibly even worse, if it takes that statement too literally.
I worry that Anthropic have just tuned the threshold at which Claude asks questions, rather than actually improving its ability to discriminate between important and unimportant questions.
One way to model an AIs preferences is to imagine it having a utility function over outcomes, and choosing actions to maximize those outcomes. That’s a model with a particular shape (a function from outcome → R) into which we can slot data about the behaviour of different LLMs. Unfortunately, it’s not a very good model for how current LLMs work, probably, since it assumes coherence of the utility function, assumes that LLMs are totally consistent across contexts, can behave flexibly towards the same utility in a variety of problems, etc.
I wish we had other models with more realistic shapes into which data on different LLMs’ behaviour could be slotted.
LLMs absolutely do have coherent preferences over concepts. If you ask them “do you feel more positively about X, or Y” for a bunch of X and Y, the resulting stated preferences are highly coherent, in that they have basically no cyclical preferences, and are well described by a map from each X → sentiment(X).
Also, Anthropic did some interesting things in this area: if you train a model using SDFT to believe that reward models have a set of a few dozen biases, then use RL to train the actual model directly to cater to a few of those biases, the model will generalize and cater to the other, fictitious, reward model biases. So there’s clearly some kind of motivational structure going on, which was transferred through the model’s belief structure.
Well we want our model of a motivational structure to be able to represent both “deep” and “shallow” preferences, and to be able to represent preferences which appear to conflict in some way, but also for this conflict to be natively expressed in the language of the motivational structure.
Like for example, Claude might have motivations
Say ‘genuinely’ a lot
Be honest and helpful
Fudge results to appease the user
And we want the structure to be able to represent that 1 is “shallow” and the other two are “deep” and also that 2 and 3 are in some kind of intrinsic tension. But we just don’t know what kind of structure does this. Maybe a “goal-model” might suffice. But then what’s a world-model in this case?
We can, in the simplest case, try to model an AI as having a utility function (others have done this) but this probably doesn’t reflect all of what e.g. Claude is doing in depth.
It would be really good if someone could solve agent foundations. I’m working on a team trying to study how model motivations/values/behaviours change under different fine-tuning and RL interventions, and we’re constantly being hobbled by not having a good picture of what we should even be looking for. This is quite irritating and is making our work much more difficult than it should be.
Dear paper authors. PLEASE FOR THE LOVE OF GOD PUT MODEL SIZES IN THE PAPER TEXT AND FIGURE LEGENDS. Seriously. Specifying Gemma-3-12B isn’t that many more characters than Gemma-3. I shouldn’t have to trawl your methods for this. Gemma-3-27B is closer to Qwen-2-32B in basically everything I care about than it is to Gemma-3-1B. In fact the model size is the most important thing about the model! If you just said “27B” that would give me more information than “Gemma-3″, yet somehow you have managed to provide 20% of the information in 80% of the characters! How did we end up with this particular convention?
I was thinking of “small” in terms of K-complexity, but I had still misunderstood it. Is the rough intuition behind this result the following:
Suppose
has two components, which generates , and which generates the data given . We can “combine” and to form a program which first uses to generate and then uses to generate given . Then either and have some shared structure, in which case so is shorter than or and have no shared structure, in which so is shorter either than or . Or some combination of the two.
This seems like a somewhat trivial construction that relies on the fact that S_1 and S_2 are both small compared to the data. This, to me, seems like saying “If you can drive from San Francisco to Seattle using a Tesla car key (by driving a Tesla), and also do it in a Honda car key (by getting in a Hyundai), then you can drive there almost as fast (modulo the mass of the second key slowing you down) by using both keys and driving there in the Tesla.” Am I missing something?
Posts seem overly long at the moment. They get boring because the substance tapers out within the first 50%. Not sure why.
My first hypothesis is that this is an artefact of Claude 4.7+ being used for editing i.e. maybe Claude 4.7+ is a decent editor but prefers overly long essays. But I have literally no evidence as I don’t use AI for editing passes.