PhD Student at Umass Amherst
Oliver Daniels
yup, my bad, editing to “receiving high reward”
Yup the latter (post-recontextualized-training model)
You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization.
But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment).
…I see shard-theory as making two interventions in the discourse:
1. emphasizing path-dependence in RL training (vs simplicity bias)
2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)
I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. “how path-dependent is RL, actually” (see my other comment)
Cool results!
One follow-up I’d be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?
The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more “complex”). You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT).
As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.
I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies
maximizingreceiving high reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
(I do agree that shard-theory made “think rigorously about reward-shaping” more salient and exciting)
The scalable oversight hope (as I understand it) requires something like the following:
1. HHH is the most “natural” generalization of supervised HHH data on easy tasks
2. Training on supervised HHH data is insufficient to generalize HHH to hard tasks
3. Producing reliable labels on hard tasks is too expensive
4. Producing unreliable labels on hard tasks is not too expensive
5. Training on unreliable labels recovers most of the capabilities produced by training on reliable labels
6. The most natural generalization of “maximize unreliable labels on hard tasks” is reward hacking
7. ICM recovers most (all?) of capabilities produced by training on reliable labels
8. ICM learns the most “natural” generalizationThe sketchy parts are 1) and 8), but overall the argument seems fairly plausible (and continuous with prior scalable oversight work on generalization).
IMO clearly passes the “is safety research” bar.
Thoughts on “alignment” proposals (i.e. reducing P(scheming))
Beware Experiment Addiction
Quick feedback loops are great, but...
I often fall into the trap of playing around with lots of minor details that (upon reflection) I don’t expect to change the results much. I do this b/c generating new results is really addicting (novelty, occasional payoff, etc). Not clear what’s optimal here (sometimes you really do need to explore), but worth keeping in mind.
Is interp easier in worlds where scheming is a problem?
The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search).
While I don’t expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem.
(though insofar as you’re worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)
tbc I was surprised by EM in general, just not this particular result
I’m surprised you’re surprised that the (simpler) policy found by SGD performs better than the (more complex) policy found by adding a conditional KL term. Let me try to pass your ITT:
In learning, there’s a tradeoff between performance and simplicity: overfitting leads to worse (iid) generalization, even though simpler policies may perform worse on the training set.
So if we are given two policies A, B produced with the same training process (but with different random seeds) and told policy A is more complex than policy B, we expect A to perform better on the training set, and B to perform better on the validation set. But here we see the opposite: policy B performs better on the validation set and the training set. So what’s up?
The key observation is that in this case, A and B are not produced by the same training process. In particular, the additional complexity of A is caused by an auxiliary loss term that we have no reason to expect would improve performance on the training dataset. And on the prior “adding additional loss terms degrades training loss”, we should decrease our expectation of A’s performance on the training set.
maybe research fads are good?
disclaimer: contrarian take that I don’t actually believe, but marginally updates me in favor of fads
Byrne Hobart has this thesis of “bubbles as coordination mechanisms” (*disclaimer, have not read the book).
If true, this should make us less sad about research fads that don’t fully deliver (e.g. SAEs) - the hype encourages people to build out infrastructure they otherwise wouldn’t that ends up being useful for other things (e.g. auto-interp, activation caching utils)
So maybe the take is “overly optimistic visions are pragmatically useful”, but be aware of operating under overly optimistic visions, and let this awareness subtly guide prioritization.
Note this also applies to conceptual research—I’m pretty skeptical that “formalizing natural abstractions” will directly lead to novel interpretability tools, but the general vibe of natural abstractions has helped my thinking about generalization.
Obvious / informal connection between SLT and information bottleneck theory:
SLT says: more degeneracies → better generalization
IB says: less mutual information b/w input and representation → better generalization
the more degeneracies a function has, the less information it can preserve.
overall how enthusiastic are you about safety motivated people developing such an architecture?
(seems to come with obviously large capability externalities—we can deploy the model outside the sandbox!)
seems like restricting the search to plausible inputs (as judged by e.g. perplexity) might overcome some of these concerns
Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging “intermediate” capabilities during RL training to avoid learning a “full” capability.
“If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.
I think the reason this doesn’t work (i.e. why you can only get a Pareto frontier) is that you can only lower bound the description length of the network / components, such that a direct comparison to “loss bits” doesn’t make sense
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
Adversarial robustness is improving, there we should expect sandbagging abilities to improve
Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
Even if sandbagging is equally difficult and we do look for it, finding “adversarial” inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks.
This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.
A view thoughts:There are roughly 2 motivations one can have for adversarial evals of scalable oversight methods:
1) identify and patch reward-hacking opportunities,
2) evaluate whether methods detect and eliminate deliberate sandbagging.this work is oriented toward 1 (which is totally valid). Orienting toward towards 2 would motivate subtle changes in the overall setup, e.g. the reward function might be something like minimizing the reference score subject to some lower bound on the overseer score: with set such that when you remove the adversarial reward, normal training does not restore normal performance. This setup would produce predictably different results e.g. we would expect adversarial consultant to exploit the lazy judge.
Asymmetries in the reward structure avoid spurious critiques
a common with critiques / debate is that the critique can raise compelling objections to true arguments that are hard to refute (this is basically obfuscated arguments, I think). But in this setup, because the reward is defined with respect to the bad word most similar to the clue, a critique can never be too good: the best critique is always the bad word used in the true reward function. This helps explain why adversarial incentives have little to no impact on overseers with critique models.
Future work might address this by having a set of “neutral” mixed with the bad words, such that a negligent overseer with a fixed budget for revealing which words are neutral and bad could be manipulated to be overly pessimistic.
I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
(e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)