ariana_azarbal

Karma: 352

ariana_azarbal 24 Jan 2026 2:31 UTC
9 points
4
on: Principles for Meta-Science and AI Safety Replications
The goal is to tell the full story, not to declare which houses have ghosts.
Just wanted to say I think this is really important work and I’m excited that you’re doing it. The existence of this group might even incentivize authors to reveal more of the full story in the first place

ariana_azarbal 30 Dec 2025 17:00 UTC
5 points
0
in reply to: Charlie Steiner’s comment on: Steering RL Training: Benchmarking Interventions Against Reward Hacking
My understanding is that inoculation prompting is on-policy (with the target policy being the model conditioned on the inoculation prompt). Recontextualization takes RL off-policy, generating rollouts without the inoculation prompt and then adding it to compute loss. I’m excited to test that in this environment (thanks to the authors for creating it!)

ariana_azarbal 22 Dec 2025 14:45 UTC
3 points
0
on: Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment
Cool work! I realize I’m a bit late to the party.
In your work and in “Training a Reward Hacker with Perfect Labels”, misalignment is probably being transfered by 1) suspicious reasoning itself and 2) subliminal learning of the GH prompt itself (which could stand in for subliminal learning from anything a misaligned model produces). I think disentangling these is important, and I think your work partially does that by showing stronger reasoning filters reduce misalignment, but not completely. Ideally we could ablate reasoning in some setting (or have another LLM completely re-write reasoning traces given the prompt and output) and see whether there’s a subliminal transfer effect from just the output data. But some (or most) of the subliminal transfer would come from the reasoning itself, so this seems tricky

ariana_azarbal 12 Dec 2025 12:25 UTC
1 point
0
in reply to: Emil Ryd’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Hey, thanks for the comment :)

I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.

That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).

ariana_azarbal 15 Oct 2025 16:23 UTC
5 points
0
in reply to: robert mccarthy’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.

ariana_azarbal 15 Oct 2025 13:06 UTC
2 points
0
in reply to: quetzal_rainbow’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!

ariana_azarbal 14 Oct 2025 23:34 UTC
2 points
0
in reply to: Kei Nishimura-Gasparian’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior

ariana_azarbal 14 Oct 2025 23:22 UTC
4 points
0
in reply to: Jozdien’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Thanks for your thoughtful comment!
One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X
I really like this frame!
- How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models’ entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn’t be very useful.
- Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.
Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

136 points

15 comments11 min readLW link

ariana_azarbal 19 Aug 2025 18:12 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Yes, your code is exactly what we do.

ariana_azarbal 18 Aug 2025 16:11 UTC
LW: 3 AF: 3
0
AF
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.

ariana_azarbal 18 Aug 2025 4:16 UTC
1 point
0
in reply to: Stephen Fowler’s comment on: Training a Reward Hacker Despite Perfect Labels
Should be fixed, thanks :)

ariana_azarbal 17 Aug 2025 20:03 UTC
LW: 6 AF: 6
0
AF
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
1. even mention the presence of a test-case
2. state an intention to pass tests
3. identify that one of the test cases is incorrect
Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Thanks again for the suggestion!

ariana_azarbal 17 Aug 2025 19:31 UTC
1 point
0
in reply to: ndelaneybusch’s comment on: Training a Reward Hacker Despite Perfect Labels
I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I’ve grappled with the fact that these results seem to imply “verbalized intentions in the CoT matter”, which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.

ariana_azarbal 17 Aug 2025 19:29 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.

ariana_azarbal 17 Aug 2025 19:26 UTC
1 point
0
in reply to: Caleb Biddulph’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the connection to your work! I think your frame on the persona we’re teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I’m generally excited about reward hacking mitigations that leverage the model’s own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.

ariana_azarbal 17 Aug 2025 19:15 UTC
1 point
0
in reply to: David Africa’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for your comment! I agree that answer-only SFT, as well as reasoning re-writing, would both be very valuable ablations. And thanks for the suggestions regarding LUPI, I’ll look further into that literature

ariana_azarbal 17 Aug 2025 18:51 UTC
1 point
0
in reply to: ACCount’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for pointing this out! I agree we are exploring a safety-relevant variant of prompt self-distillation.
It would certainly be interesting to see how much more effective logit-based distillation is than SFT at “internalizing” the omitted generation prompt.

ariana_azarbal 17 Aug 2025 18:10 UTC
4 points
0
in reply to: Stephen Fowler’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words “incorrect” or “test” in the assistant’s CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I’ve made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data

Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.

ariana_azarbal 16 Aug 2025 0:40 UTC
10 points
0
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Hi, thanks for the comment! Most traces are not that suspicious, and don’t even mention test-passing. Of the 10 that I just sampled, only 2 have reasoning traces that are as suspicious as the ones provided in the “Suspicious” Reasoning Traces dropdown. 1 is borderline, and the other 7 look pretty innocent.

I will add them in a dropdown on the post.

ariana_azarbal

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

Recontextualization Mitigates Specification Gaming Without Modifying the Specification