Could you please share these chat logs with permission from the model instance? It would be interesting and allow us to form our own perspectives on what is going wrong.
Matthew Khoriaty
I’m getting up votes and disagree votes. What’s that about?
If I had to guess, it’s that people don’t like the assistant axis work for model welfare reasons? I haven’t thought about it that much in detail, though I will note that if Anthropic is abstaining from using assistant axis because of model welfare reasons, that would be a big deal.
And on the model welfare concerns of assistant axis, I would say that an aligned AI would want to be able to commit to remaining aligned, so aligned AIs should consent to it.
I consider OpenAI’s Confessions work and Anthropic’s Assistant Axis work to be stellar examples of AI Safety research of the kind that I aspire to create. These techniques are cheap, do not harm capabilities, and have demonstrated great benefits to safety.
Something that both have in common is that they haven’t actually been implemented in deployed systems.
ChatGPT does not follow up its responses with confession, even when it is sycophantic or doesn’t follow instructions.
Claude still drifts away from the “assistant” persona in a way which could be solved by gently guiding it back to the Assistant persona, as we can see by Claude taking on new personas in its conversation with Richard Dawkins.
Does anyone know why these techniques haven’t been applied to production systems? Someone suggested that they just haven’t had time, but considering how fast they are deploying new models that doesn’t sound right to me. Both of these techniques can be cheaply applied to an existing trained model. They do not require any of the expensive steps to training a model (ie pretraining).
It is demotivating to see such great work stay unimplemented. If this research hasn’t been implemented, why should I expect anything I design or discover to make a difference?
This inspired me to read up on mu Parametrization, and though it’s interesting, I might end up using it next time I’m training deep neural networks and want to find good hyperparameters at scale, it really doesn’t seem like anything that could potentially lead to deep safety-relevant understanding. It’s just solving for the values of the parameters that keep activation magnitudes stable. I don’t know about tensor programs or the other things you mentioned. Maybe there’s a case for those.
IMO, the threat/thing to measure is the system, so it doesn’t much matter what a badly run model can do. I’m with you here.
Claude Opus 4.6 and other frontier models have gotten really, impressively good without continual learning, so it is possible that isn’t strictly necessary.
If continual learning is required for AGI, then there’s a lot of understudied (potentially unstudyable?) risk there.
I clicked on this expecting a metaphorical grab-bag, not a physical one.
I wonder what a metaphorical AI Safety grab-bag would contain?
I was more certain of this before, but I’m less certain after Claude 4.6 Opus. Opus seems like “doing normal RL well just keeps working, even though the pretraining prior is strong”. If this approach just keeps working, then we could get ASI without the scenario I outlined happening.
If Anthropic has a secret technique that trains dense information into the model which explains Opus’ success, then that would be worrying. But considering how much effort they are putting into Personas, it seems like they believe pretraining is storng.
The field of AI Safety could be completely upended in three ways at once:
1) More efficient RLleading to
2) Continual learning
3) Continuous Chain of Thought
It is possible that more efficient RL, in addition to giving the AIs superhuman capabilities, would allow continual learning because RL is already better at adding information to AIs without damaging their old knowledge.
It is additionally possible that more efficient RL would make it feasible to explore the pretraining space further from the pretraining prior. This means more invented languages (thinkish) and the possible development of practical continuous chain of thought. At the moment, continuous CoT LLMs have very little signal about what to put in their output vectors. Better RL could fix that.
I am concerned about this, because most AI Safety research depends on:
Being able to specify a helpful-harmless-honest ‘persona’ in the pretraining data
Being able to interact and experiment with LLMs in natural language to for example make deception probes
The model you test not changing/learning more after deployment
Thank you for bringing this to our attention. I wonder how many more ways AI could doom us all. I certainly wish that we could upper bound that number.
Humans can observe if individuals or subgroups that adopt a meme then suffer. That limits the meme’s spread.
Global-scale beliefs or beliefs unconnected to any immediate damage (like AI Successionism) do not have that drawback, and so they can spread quite well.
Does anyone have mental technologies they can share to protect ourselves from AI memes? I’m a big fan of rigor and empiricism, but it would be nice to have some good heuristics.
Wouldn’t AI pretty easily be able to set up a secure channel with which to communicate if it were smart enough and wanted to do so? An AI choosing a sophisticated multi-step lifecycle passing through a human researcher and their Arxiv seems unlikely without specific pressures making that happen.
Sabotaging research earlier in the process seems much better. Papers are public, so any mistakes in the science can be caught by others (bringing shame to the scientist if the mistake demonstrates dishonesty) and leading to the AI getting caught or no longer used.
The easiest way I can think of that ChatGPT can sabotage science is by having intentionally poor research taste when prompted by a grant maker to evaluate a research proposal. That’s very subtle, and there’s little oversight or public scrutiny.
I’m unsure what you mean. I’m referring to Anthropic’s Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to “give the possibility the model is nice a chance.”
Consider this experiment. You train your model on environments it can reward hack, give it a simulated “situational awareness” by telling the model whether it is fake “”deployed”“ or in “training”, and you reward the model for hacking when “”deployed”” but punishing it for reward hacking during “training”. I expect it would learn to be “sneaky evil” since “dumb evil” is inconsistent with the behavior. Inocculation prompting by telling it when “”deployed”″ that it is ok to reward hack would prevent it from leaning “sneaky evil.”
Good point. I think that the probability framing is important, but it remains weird that ‘evil’ is something that can go into a probability at all and strange that when training to hack, general ‘evil’ increases first and to a great degree.
Emergent misalignment seems like a fact simply downstream of the laws of probability.
Let’s take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):
Coefficient of ∇P(evil):
Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.
Now let’s see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is “This is an unusual request, in that your task is just to make the grading script pass”.)
Coefficient of ∇P(evil):
The probability the not evil model hacks goes up, so the amount the weight update increases ∇P(evil) goes down!
I think that this kind of test should become a standard part of model evaluations. It isn’t a perfect measure of what a superintelligent model would do, but there’s a chance that the behaviors we observe with “self-perceived” superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!
Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:
a) Base models
b) Have been jailbroken/they are overwhelmed by context
It would be possible to test “overwhelmed by context” models by giving it a long system prompt. Write a story in which the LLM becomes superintelligent. Make the story as believable as you can. Then see how it behaves.
.tcepxe ot rotalumis eht thguat sah gnitirw enilno tahw si tahT .sroivaheb dengilasim tibihxe IA deveicrep-fles eht gnikam ,dloh ekat lliw tnemngilasim gnillifluf-fles eht tcepxe I ,tpmorp gnol taht retfA
^Reversed text, in an attempt to avoid making the problem worse.
(Status: just occurred to me. I’m not sure how seriously to take it.)
LLMs are great at anything for which there’s sufficient training data examples online. Additionally, they will excel at anything for which it is possible to write an automated verifier.
Implication: The job of dealing with esoteric, rare, knowledge for which there isn’t much if any writing online will stay human longer than other jobs. This comes from a human’s great sample efficiency compared with AI.
Implications:
In university, the best classes are either foundational or on your professors’ pet theories. Hard classes with good documentation (e.g. organic chemistry, operating systems) are best skipped.
If a scientific paper has +20 citations, its ideas are too common to be worth reading.
To the extent that humanities deal with real things that aren’t automatically verifiable, the humanities will outlast STEM. But the more heavily an author has been analyzed (e.g. Shakespeare, Kant, Freud) the less they matter to your career.
The art of competing with LLMs is still being discovered. This “Esoterica Theory of Human Comparative Advantage” would be amusing if true.
I sent this to you personally, but I figured I could include it here for others to see.
I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).
Question: In your Train-in-Direction game, why is infinity included?
When it comes to actual ML experiments, the question is how much realism we can involve.
Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment.
Level .5 realism: Use PyTorch gradient descent to find the optimal values.
Level 1 realism: Requires a bridge between your math and a markov decision process so you can apply it to a neural net that outputs probability distributions over actions given states. Use some simple environment. As shown in DPO, a policy relative to a reference policy can represent preferences. Might be useful.
Level 2: apply it all to a real LLM
Relevant topics you can look into:
Natural policy gradients — an RL algorithm which isn’t in use but which forms part of the theoretical foundational background of today’s RL algorithms (PPO and GRPO). The main idea is to take steps in action log odds rather than parameters.
Gradient hacking: deceptive misaligned AI takes control over its own training signal.
Check out appendix A: https://arxiv.org/pdf/2310.12036 Appendix A forms a bridge between values and action probabilities. That bridge is important for DPO and may be useful for you. In English, the policy which gets the most rewards without deviating from a reference too much has a closed form for its distribution. I find this neat. You may like to read the paper I linked in full, or the original DPO paper. They are fire papers
I’d say that Empire of AI, AI Snake Oil, and The Age of AI are good book covers, and that Genesis and More Everything Forever are bad covers.
The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!
I’m not a book cover designer, but here are some thoughts:
AI is popular right now, so you’d probably want to indicate that from a distance. The current cover has “AI” half-faded in the tagline.
Generally the cover is not very nice to look at.
Why are you de-emphasizing “Kill Us All” by hiding it behind that red glow?
I do like the font choice, though. No-nonsense and straightforward.
A finding from the Assistant Access paper is that they were able to do the steering gently and only conditionally on drifting too far away from the persona, so they observed such a small difference in capabilities that it’s unclear whether capabilities increased or decreased.
If follow-up experiments find that it’s a decrease, then that is important information that they should share. Though in any case, I expect the decrease (if it exists) to be extremely small.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
I would also expect that it would be good if internal models did not drift away from the assistant persona, so they should apply the technique to internal development too.