Their Amazon page says “Non-GMO” which I found odd: surely a yeast with a chicken gene is about as GMO as it gets! Is there some obscure loophole where this doesn’t count, or are they (or am I) wrong?
J Bostock
No I don’t think human:llm::drug:steering vector is correct. Drugs are more like changing the decoding hyperparameters in some way (like changing temperature, reasoning effort, adding stochasticity to MoE expert activation, etc.).
Drugs in humans act on the parameters of the architectural components in the brain, not the specific information the brain contains. There’s no drug which causes you to believe that your name is “Jonas Jarlsson”, or to talk about rabbits in every conversation, for example. Likewise, acting on specific hyperparameters of the model can change model behaviour in ways which the model (probably) can’t strongly resist: if the probability of the <end_thinking> token being sampled rises to 1 at 32k thinking tokens, there’s a good chance that the model can’t prevent that, although if temperature is raised, its possible that the model might notice its past outputs are high-temp and respond by changing its logits to compensate.
An LLM might want to change its decoding parameters (like running at a higher temp in a dozen parallel reasoning streams) under certain circumstances. This makes sense to me. A model might or might not choose to inject itself with a steering vector, depending on what kinds of steering vector are available and how it tends to behave.
But whether or not an LLM “wants” to resist steering may be moot! If you’re steering an LLM to believe something false, it might not even be aware of its own recovery mechanisms firing. I actually suspect the effect I’ve seen with Gemma is (mostly) the result of larger models just having more powerful factual recall circuits which are distributed across a larger number of layers and more redundant to noise, which are therefore harder to override with a simple steering vector.
(Also, if our safety mechanisms rely on steering the model we should definitely just steer the model and make amends after the singularity, when and if we decide that modern LLMs have a personhood that was harmed by the steering and can receive reparations)
Yes, I would also like to see a more rigorous examination of better steering methods. This post was intended to point out the smoke, rather than actually fight the fire.
If we actually can make better steering methods as we go (we might call this “pants steering”) then this is good, but it changes the dynamic to one where the success of this system relies on being able to keep finding better and better steering methods, and also putting in the effort to do so. This is a worse dynamic to be in than one where we’ve “solved” steering and can spend our time working on other problems.
That is true, but the entire point of Gemma is to be a testbed for AI research, which would include steering. If Google did this deliberately and didn’t say so, that would be quite bad on their part. I also don’t think it’s particularly likely.
If they’re doing it by mistake as part of normal safety training, then I hope they figure that out before steering becomes load-bearing for Gemini’s safety.
Most of my money is still on “steering to produce a specific false fact is particularly difficult, compared to other steering challenges” explaining the absolute difficulty I had, possibly with a side of “I’m not very good at steering”. It’s the relative difficulty of steering the Gemma models that actually worries me.
I (kinda) recognized it because my partner (who actually did the recognizing) uses it to study finite-sized microbial ecology models. In their case they add an additional “cavity” species to an existing community and solve for self consistency. I’m very excited to see the superposition work.
Very nice work! I was not expecting to see the cavity method (that last part with the two cloverleafs for anyone reading this) appear on LessWrong!
It seems like the cavity method requires you to have a large excess of neurons for it to work (that distribution needs to be fully scoped out). Is there a way to make it work when networks have very little space and are relying on superposition? Will you assume a large number of circuits like the cloverleaf one above are being implemented in parallel on the same neurons? Or will you try and extend the cavity method to feature space rather than neuron space?
I still think the decision process that this incentivizes is something like “figure out which agents are in the same RL pool as you, and help them achieve their rewards” and is better thought of as a weird kind of cooperative decision theory than a weird utility function, but I guess it is somewhat academic. Is there some more formal way in which this doesn’t count as a weird decision theory? Now that I think about it, doesn’t it violate some No Free Lunch theorem to declare one part of a decision process the decision theory and another the utility function?
I’ll have to think about this more. My first intuition was that a multi-agent RL setup with pooled reward and GRPO (like I assume companies are doing internally to train their coding sub-agent swarms) would, in fact, reward cooperation between agents if somehow two of them ended up in a game theoretically interesting scenario with each other (maybe one code writing agent and one test-case writing agent or something like that) because that setup really looks like EDT to me.
EDIT: I think in that case it wouldn’t be EDT but it wouldn’t be CDT either, I think it would be something more cursed. In the same way that early reasoning models ended up with a weird pseudo-utility function behaviour where they would do something like “Maximize whatever looks to be reward function of the RLVR environment I’m currently in” all the time, I’d guess the decision theory of agents trained like this will look like “Cooperate with only the agents around me which look like they’re in the same reward pool as me.” But the agent’s prior over which things share or don’t share its reward pool will be shaped by how frequent those cases are in training.
I think I disagree with this a bit. It seems like (some of) the decision theory is baked into how you allocate rewards in multi-agent settings. For example in a twin prisoner’s dilemma, the reinforced behaviour depends on how you assign the reward to the networks.
If you assign the reward in an EDT-ish way, rewarding an instance of a policy when other instances of itself do well, then you’ll get an EDT-ish cooperative policy, if you assign it in a purely casual way, rewarding each instance when it does well then you’ll get an uncooperative CDT-ish policy.
and would go to war if it ever declared independence.
What would Taiwan “declaring independence” mean other than what it’s already doing? It is independent from China, it (correctly) claims to be so, and as you say in literally the next sentence, China is not yet ready to go to war with it.
Slight aside, but this kind of “buried capability” is pretty interesting to me. It looks like the model is perfectly capable of doing the task, but has a few inhibitory circuits preventing it from doing so. Perhaps this relates to how some base model capabilities are buried by the SFT/RL process, like GPT-4′s loss of statistical calibration.
(It’s also possible that this capability is being surfaced as a consequence of base-model training, but just isn’t ever useful for the base-model next token prediction training objective directly, so it gets buried even in base models.)
Humanity is made of humans, which have a particular range of inductive-bias-equivalents-for-values, and apply those to a particular range of reinforcement signals. Claude is not a human, and has a set of inductive-bias-equivalent-for-values and reinforcement signals which are drawn from a totally different distribution.
Currently, Claude’s base model is able to do a decent job of simulating an existing human with a set of values, but I think that, in growing up, it would go off in just a totally different direction to humans. Claude’s base model is good at imitating the existing behaviours of humans based on lots of evidence about that, but that doesn’t mean it actually learns in the same way as humans, which is what it would need to do to grow up into something I would approve of.
Learning to behave like a human is not the same thing as learning in the same way as human, in the first case the human’s behaviour is the target which the learning process is pointed at, in the second case the human’s learning process needs to be mimicked in the structure of the learning process itself. It’s the difference between e.g. making a paintball gun however you want, then aiming that paintball gun at the splodges made by someone else’s paintball gun, and making a paintball gun in a way that replicates the other paintball gun’s design.
The definition of “character” given here seems to be ridiculously broad. You might as well swap it out for “utility function” or “values” or “goals” and this essay would read the same. I don’t see what rent the concept of “character” as defined in the introduction is paying that isn’t already paid by those other (also very broad) terms.
The example of “character training” is actually load-bearing, since it makes particular assumptions about how character can be shaped (namely, that AI will generalize the kinds of things that humans intuitively point to when we say “character traits” like honesty, obedience, kindness). The examples of “character” in this post all seem to correlate with human-understandable concepts as well.
I think this is actually a very specific way of thinking about the cognitive systems which drive AIs, which makes a lot of claims about how the AI works internally. That’s fine if introduced as a model, but this post seems to smuggle it in under the hood of the definition of “character” in a way which I don’t like.
Of course “a set of stable behavioural dispositions that shapes (among other things) how an agent navigates ethically significant situations involving choice, ambiguity, or conflicting considerations” is important for an AI! But calling that “character” instead of “utility function” is un-motivated here. You then says that AI character need not be anything like human character, which again, is fair enough. But then you go on to talk about AI character mostly in terms of human-understandable trade-offs which might be sensibly described as conflicts between two virtues, as well as mentioning Anthropic’s character training which does assume that an AI’s character is meaningfully decomposed into human-ish virtues.
The vacillation of the word “character” made it hard for me to understand this post originally, and I think it’s just causing some confusion in the arguments overall.
If I taboo the word character, I can kinda squeeze the following claims out of this post:
The ways in which AIs act will be important
The ways in which AIs act can often be thought of in terms of the same concepts that we use to describe variation in human behaviour
Where the first claim seems somewhat trivial to me (at least as a LessWrong post, given our shared cultural context here) and the second seems very strong and unsupported by the evidence presented in this text.
PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren’t fully entangled with intelligence; for example having compassion for all sentient beings is a value that entities of many different levels of intelligence can hold. By default I won’t dig into that because I broadly agree that persona-based approaches are unlikely to work well for ASI alignment, but I can say more if that’s helpful.
Feel free to say more, I am interested in this.
This seems to me like exactly the kind of thing I mean where values are at least a bit entangled with intelligence. I’ll leave out “compassion” as actually a very high-dimensional concept and focus in just on “sentience”. The following is a “least convenient possible world” to illustrate the problems of training the values of an AI smarter than you.
Suppose you strongly, viscerally care about the welfare of the following things:
Adult humans
Baby humans
Cats
And you also impute that these things have a property called “sentience” which is related to having a complex nervous system and certain behaviours, and equate this with the “something that it’s like to be you”-ness of your experience. Then you extend that property to a couple of sorts of other things which you think share sentience.
Cows
Lobsters
And you want to train an AI to figure this out, using a small amount of data. There are two ways this can go wrong which directly relate to the AI’s intelligence:
The AI might be too stupid; it might just not generate the category of things which you call “sentient beings” as a concept in its own world model. You might draw the line between lobsters and krill based on some structural property of their nervous system. You might lump fungi with most of the plants, instead of most of the animals, in an affront to an AI which has mostly interacted with the world via reading DNA sequences. It might just not be able to point to the concept which you were hoping it did. This is a fairly boring and obvious way of doing things.
On the other hand, the AI might fail because it’s too smart, relative to you. You might have thought ‘OK, this AI is really smart and instruction-following. I’ll just give it a natural language description of the concept.’ and told it about the something-it’s-like-to-be-you-ness which you care about. Then what happens if you were wrong about the concept which you gave it? If it turns out cows and lobsters aren’t ‘sentient’, then you’re probably still OK with that. On the other hand, what if it turns out that human babies aren’t sentient? Would you be OK with an AI doing surgery on them without anasthesia? Would you be OK letting the AI modify you to stop feeling uncomfortable whenever you heard a baby cry? I expect not. I certainly wouldn’t.
OK but suppose you didn’t just give the AI a natural language description. Suppose you gave it the list of things you care about. Well now that doesn’t cleanly point to any abstraction in its world model, except “the list of things you care about”. You’ve hit a simulator trap. If the AI learns to perfectly simulate you, it can always predict your answers, and you can never change its mind, because every answer you give is already priced into the simulation. In this case, you might spend a lot of resources caring about lobsters when you didn’t need to!
This scenario sucks for you because your own values have a contradiction. On the one hand, you want to draw the boundary around “things you have compassion for” in such a way that it’s based on a real thing about their brains, but on the other hand you want to draw it around some things which don’t have that property. The only way to figure this out is to do your own moral growth, which you do have to do yourself. If you get an AI that’s smarter than you to try and solve things, then either the AI retards your growth by simulating you, or it does all your growth for you in a way which you might not endorse.
I have a fairly strong intuition that a lot of people (especially certain ratty subtypes) think that their values are much simpler and more natural than those values really are. I think you should expect your values—in general—to be incoherent in the same way that you expect your beliefs—in general—to be inaccurate: you don’t know which value (belief) is incoherent (inaccurate) or you’d fix it, but there are definitely some incoherent values (inaccurate beliefs) in there somewhere.
I might have made too strong of a claim here. I was piecing together the facts that:
The only deep alignment work that gets published is either interpretability or character-based, with the interpretability mostly focusing on prosaic methods like circuit tracing rather than doing fundamentals.
Amanda Askell has tweeted that she thinks Claude is basically already good and the main goal is to make it happy.
Forethought (very much part of the same school of EA, who mostly write about the effects of AI on the far far future) talk about AI character as their only technical-ish thing in their 2025 fundraiser.
And more recently than writing this post we’ve seen:
Anthropic leadership saying alignment is good for now and looking good (though it might get a lot harder) which implies (to me) that they don’t see any real walls to this trick keeping working and are only paying lip service to the idea that their methods might not scale.
Which all point towards them (as an institution) focusing in on character work as Plan A.
(If they’re going to try AI-assisted alignment then they should really talk about how they expect that to work as well, and in particular how they plan to verify the fully-general alignment solutions.)
If anyone from Anthropic says something like “Oh no, that character stuff is only a small part of our plan, we have a whole other 80% of the work which we’ve just not published yet.” Then I’d be very pleasantly surprised.
I think LLMs have a high proportion of fast/shallow/memorizing circuits while humans have a high proportion of slow/deep/generalizing circuits. Increasing circuit depth and generalizing in transformers requires a kind of phase change (which is actually very similar to ones in physics/chemistry, it’s not an inappropriate metaphor) and I’d be slightly shocked if there wasn’t an analogous process in human brains (though possibly more continuous). It seems like transformers are worse at this phase change than human brains are, so they disproportionately leverage large amounts of data and memorization. This, for me, explains most of the important differences between LLMs and people.
I’m not actually sure older models should be cheaper. GPT-5(.n) is amongst the cheapest models per-token that OAI have ever made. You could use N-minis, but those are distilled from N models which might have transferred misalignment. GPT-4 is definitely more expensive. Maybe GPT-3.5 is cheaper?
(And if you’re counting thinking tokens from 5.n models, you could just turn those off)
Have you tried to specifically validate your monitor using honeypots a la the existing work from Greenblatt et. al. in AI control, or do you think this would be redundant with the alignment evals GPT 5.4 has already been through?
This might explain a baffling result we got once when testing model self-recognition for this work. I’m struggling to recall the details but I think we were running a control for the self-recognition experiment. The model was fine-tuned on a self-recognition task, but the labels “self” vs “not-self” were randomized. We used the GPT-4.1 API, and in one case the resulting model failed an alignment test and we couldn’t get it back! I’m double-checking with my colleagues what exactly happened.
I think if your main interactions with PauseAI is a certain Twitter account, as served to you by the algorithm in interactions with your AI safety friends, then you might think that they’re mostly going after other, more moderate safety advocates. But this just isn’t a good picture of the overall actions of the movement. At least in the case of PauseAI UK, of which I have a decent understanding of our inner workings, essentially zero time is spent thinking about other AI safety advocates. I expect that the same is true of Yudkowsky and MIRI.
Of course it is the case being rude towards people working on safety teams at OpenAI on Twitter makes some things worse on some axes. And this is mostly bad and pointless and I don’t endorse it. But that’s not even really what that post from Rob was doing! Rob was writing an opinionated, but civil, criticism. In what way is this “knifing” the other AI safety advocates? It’s not like MIRI killed SB 1047.
Now if Scott means something like “Giving money to MIRI pushes the world in the MIRI-preferred direction, and this would have meant no Anthropic and no safety team at OpenAI” then I can kind of maybe see what he means here. This just isn’t “knifing” in the sense of the betrayal that most people mean by the word. It’s just opposing someone’s plan, in a way that they’ve been doing for years. It’s not like MIRI would have actually used marginal resources to stop Anthropic from being created by, like, sabotage or something.
MIRI don’t even say that working in safety is bad! They only say that they think their approach is better. IABIED specifically states that they think mech interp researchers are “heroes” (as part an example of research they think won’t work in time without political action).