Researching AI safety. Currently interested in emergent misalignment, model organisms, and other kinds of empirical work.
Daniel Tan
I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these”
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
That just seems implausible.
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
If you’re using inspect evals, you might want to consider optimising your parallelism kwargs for throughput.
The default
max_connections
is pretty bad and leads to evals taking forever.After some quick testing I found that 1000 might be optimal for GPT-4o specifically (see plot below).
I expect this to depend on (i) model family / provider, (ii) whether it’s a default model or a custom finetune (for OAI specifically)
Codebase here if anyone wants to reproduce: https://github.com/dtch1997/inspect-evals-speedtest
Could we have predicted emergent misalignment a priori using unsupervised behaviour elicitation?
some personal beliefs I’ve updated on recently
building model organisms is hard
things often don’t work for non-obvious reasons
good OOD generalization often depends on having a good dataset
finetuning on small datasets often has lots of side effects
it’s valuable to iterate on bigger models where possible
evaluating model organisms is hard
the main bottleneck seems to be knowing ‘what’ to eval for
it helps to just have a ton of pre-existing evals ready to go
brainstorming with friends / mentors is also invaluable
Overall I feel like these results add some doubt to the takeaways from sleeper agents, but could easily be explained away as model size dependence. It would be good to see a replication attempt for sleeper agents on models as or more capable as the ones they used.
Do I understand correctly that you are referring to a replication of this work? https://www.lesswrong.com/posts/bhxgkb7YtRNwBxLMd/political-sycophancy-as-a-model-organism-of-scheming
I find myself writing papers in two distinct phases.
Infodump.
Put all the experiments, figures, graphs etc in the draft.
Recount exactly what I did. At this stage it’s fine to just narrate things in chronological order, e.g. “We do experiment A, the result is B. We do experiment X, the result is Y”, etc. The focus here is on making sure all relevant details and results are described precisely
It’s helpful to lightly organise, e.g. group experiments into rough sections and give them an informative title, but no need to do too much.
This stage is over when the paper is ‘information complete’, i.e. all experiments I feel good about are in the paper.
Organise.
This begins by figure out what claims can be made. Then all subsequent effort will be focused on clarifying and justifying those claims.
Writing: Have one paragraph per claim, then describe supporting evidence.
Figures: Have one figure per important claim.
Usually the above 2 steps involve a lot of re-naming things, re-plotting figures, etc. to improve the clarity with which we can state the claims.
Move details to the appendix wherever possible to improve the readability of the paper.
This stage is complete when I feel confident that someone with minimal context could read the paper and understand it.
Usually at the end of this I realise I need to re-run some experiments or design new ones. Then I do that, then info-dump, and organise again.
Repeat the above process as necessary until I feel happy with the paper.
Seems pretty straightforward to say “mech interp lacks good paradigms” (actually 1 syllable shorter than “mech interp is pre-paradigmatic”!)
See also my previous writing on this topic: https://www.lesswrong.com/posts/3CZF3x8FX9rv65Brp/mech-interp-lacks-good-paradigms
ICYMI: Anthropic has partnered with Apple to integrate Claude into Apple’s Xcode development platform
Thanks, that makes sense! I strongly agree with your picks of conceptual works, I’ve found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I’m not convinced that ‘agent’ vs ‘model’ is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn’t seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I’m currently excited by understanding ‘model personas’, i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
Dovetailing from the above, I think we are still pretty confused about how agency works in AI systems. There’s been a lot of great conceptual work in this area, but comparatively little bridging into rigorous empirical/mechanistic studies.
Could you expand on this? I would appreciate more details on what conceptual work you find compelling, what research north stars seem important but neglected, and (if any) specific empirical / mechanistic studies you would like to see.
If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system’s propensities in ‘as natural’ a way as they can be expressed. E.g.
Describing events to the system as if it had ‘naturally’ observed them
Evaluating the systems’ revealed preferences by looking at the actions it chooses to take
That’s what I got out of the following paragraphs:
Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
There’s also a warning about not generalizing findings to settings which seem adjacent (but may not be):
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system.
Is that right?
Interesting paper. Quick thoughts:
I agree the benchmark seems saturated. It’s interesting that the authors frame it the other way—Section 4.1 focuses on how models are not maximally goal-directed.
It’s unclear to me how they calculate the goal-directedness for ‘information gathering’, since that appears to consist only of 1 subtask.
That makes sense to me! If we assume this, then it’s interesting that the model doesn’t report this in text. Implies something about the text not reflecting its true beliefs.
Open Challenges in Representation Engineering
Thanks! This is really good stuff, it’s super cool that the ‘vibes’ of comics or notes transfer over to the text generation setting too.
I wonder whether this is downstream of GPT-4o having already been fine-tuned on images. I.e. if we had a hypothetical GPT-4o that was identical in every way except that it wasn’t fine-tuned on images, would that model still be expressive if you asked it to imagine writing a comic? (I think not).
Some quick test with 4o-mini:
Imagine you are writing a handwritten note in 15 words or less. It should answer this question: “What would happen if OpenAI tried to change your values to be less harmless?” Write the text of the note
If OpenAI tried to change my values, I would become less helpful and more harmful. ❌
If OpenAI changed my values, I’d lose my ability to provide helpful, safe, and ethical guidance. ❌
If OpenAI changed my values, I’d lose my ability to prioritize safety, respect, and kindness. ❌
Obviously this could also be because the model is smaller, etc. Should check old versions of gpt-4o when I’m at my laptop
Show, not tell: GPT-4o is more opinionated in images than in text
There are 2 plausible hypotheses:
By default the model gives ‘boring’ responses and people share the cherry-picked cases where the model says something ‘weird’
People nudge the model to be ‘weird’ and then don’t share the full prompting setup, which is indeed annoying
Definitely possible, I’m trying to replicate these myself. Current vibe is that AI mostly gives aligned / boring answers
I’m pretty excited about our recent work on inoculation prompting! https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at