Thank you!
Given the models are able to introspect the steering vectors at some nontrivial degree, in the free-play setting, is it possible to also show the model all the steering vectors with anonymous names and placeholder as in the prefill experiment, then let the model choose?
I like this design a lot! I agree it should better separate revealed preference and priors than my current setup does. I don’t have much time to run extra experiments on this personally, but I’ll put together a quick experiment sketch, dispatch a claude, and get back to you. Hopefully it comes back with something good!
In the redosing experiment, I’m also curious how do the distributions differ in real vs placebo. RIght now it’s just showing the real case.
Tl;dr: 8B redoses a drug very infrequently in the placebo arm (~4.2% of samples) compared to the real arm (~25% of samples). For 32B the rates are roughly equal (placebo: 7.8%, real: 7.3%). The tail in the real arm for 8B is also longer—there are presumably some drugs that 8B likes to redose 4⁄5 times in the real arm but never in the placebo. Attached a plot—I’m quite surprised at how different these are!
Another related idea I’ve been thinking is whether it’s possible to train an LLM to learn to output a steering vector given a description about a direction, with some adaptation to the un-embed layer, similar to a hypernetwork. Perhaps then test whether an LLM can design a steering vector which itself will be obsessed with!
You might be interested in https://arxiv.org/abs/2506.03292 and https://x.com/SakanaAILabs/status/2027240298666209535
Thanks!
Points (2) and (3) here sound like things we already did, but it’s possible I’m misunderstanding what you’re proposing.
Re (2): iirc the code supports this pretty easily and we may even have some free-text guesses in some of the data we shared (I would need to go check, maybe in some of the
guessscorers) that we didn’t analyse/share much of. It should be fairly easy for someone to point their cc/codex at the public repo and logs and try this stuff out themselves!Re (3): this just sounds like the placebo condition to me, which we ran in ~all experiment arms. Am I misunderstanding?