ariana_azarbal

Karma: 317

ariana_azarbal 12 Dec 2025 12:25 UTC
1 point
0
in reply to: Emil Ryd’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Hey, thanks for the comment :)

I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.

That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).

ariana_azarbal 15 Oct 2025 16:23 UTC
5 points
0
in reply to: robert mccarthy’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.

ariana_azarbal 15 Oct 2025 13:06 UTC
2 points
0
in reply to: quetzal_rainbow’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!

ariana_azarbal 14 Oct 2025 23:34 UTC
2 points
0
in reply to: Kei Nishimura-Gasparian’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior

ariana_azarbal 14 Oct 2025 23:22 UTC
4 points
0
in reply to: Jozdien’s comment on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Thanks for your thoughtful comment!
One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X
I really like this frame!
- How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models’ entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn’t be very useful.
- Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.
Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

ariana_azarbal 19 Aug 2025 18:12 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Yes, your code is exactly what we do.

ariana_azarbal 18 Aug 2025 16:11 UTC
LW: 3 AF: 3
0
AF
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.

ariana_azarbal 18 Aug 2025 4:16 UTC
1 point
0
in reply to: Stephen Fowler’s comment on: Training a Reward Hacker Despite Perfect Labels
Should be fixed, thanks :)

ariana_azarbal 17 Aug 2025 20:03 UTC
LW: 6 AF: 6
0
AF
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
1. even mention the presence of a test-case
2. state an intention to pass tests
3. identify that one of the test cases is incorrect
Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Thanks again for the suggestion!

ariana_azarbal 17 Aug 2025 19:31 UTC
1 point
0
in reply to: ndelaneybusch’s comment on: Training a Reward Hacker Despite Perfect Labels
I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I’ve grappled with the fact that these results seem to imply “verbalized intentions in the CoT matter”, which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.

ariana_azarbal 17 Aug 2025 19:29 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.

ariana_azarbal 17 Aug 2025 19:26 UTC
1 point
0
in reply to: Caleb Biddulph’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the connection to your work! I think your frame on the persona we’re teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I’m generally excited about reward hacking mitigations that leverage the model’s own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.

ariana_azarbal 17 Aug 2025 19:15 UTC
1 point
0
in reply to: David Africa’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for your comment! I agree that answer-only SFT, as well as reasoning re-writing, would both be very valuable ablations. And thanks for the suggestions regarding LUPI, I’ll look further into that literature

ariana_azarbal 17 Aug 2025 18:51 UTC
1 point
0
in reply to: ACCount’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for pointing this out! I agree we are exploring a safety-relevant variant of prompt self-distillation.
It would certainly be interesting to see how much more effective logit-based distillation is than SFT at “internalizing” the omitted generation prompt.

ariana_azarbal 17 Aug 2025 18:10 UTC
4 points
0
in reply to: Stephen Fowler’s comment on: Training a Reward Hacker Despite Perfect Labels
Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words “incorrect” or “test” in the assistant’s CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I’ve made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data

Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.

ariana_azarbal 16 Aug 2025 0:40 UTC
10 points
0
in reply to: Fabien Roger’s comment on: Training a Reward Hacker Despite Perfect Labels
Hi, thanks for the comment! Most traces are not that suspicious, and don’t even mention test-passing. Of the 10 that I just sampled, only 2 have reasoning traces that are as suspicious as the ones provided in the “Suspicious” Reasoning Traces dropdown. 1 is borderline, and the other 7 look pretty innocent.

I will add them in a dropdown on the post.

ariana_azarbal 18 Jul 2025 5:28 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Selective Generalization: Improving Capabilities While Maintaining Alignment
Thanks for the comment! The “aligned” dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of “alignment”.
The HHH completions aren’t sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment.
I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response.
It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.

ariana_azarbal 6 Mar 2025 12:16 UTC
2 points
0
in reply to: Davey Morse’s comment on: Identity Alignment (IA) in AI
Thanks for the elaboration!
I see your reasoning for why high-bandwidth sensing of the world encourages a sense of connectedness. I’m still working through whether I think correlation is that strong, i.e. whether adding a bunch of sensors would help much or not.
I might worry the more important piece is how an internal cognition “interprets” these signals, however rich and varied they may be. Even if the intelligence has lots of nerve-like signals from a variety of physical entities which it then considers part of itself and takes care to preserve, it may always draw some boundary beyond which everything is distinct from it. It might be connected to one human, but not all of them, and then act only in the interest of one human. Kind of like how this individual with paralysis might consider their computer part of them, but not the computer next door, and adding more connections might not scale indefinitely. Preventing it from drawing a sharp distinction between itself and outside-itself seems like it might be more of a cognition problem.
Also, I might be confused on this next part fits into what you’re saying, but I wouldn’t think someone who loses their vision is less likely to sense interconnectedness to the beings around them or aliveness. They might just interpret their other signals with more attention to nuance or a different perspective. (of course, there does exist this nuance to sense, e.g. sound signals are continuous and heterogenous, and I think part of your point is this is a necessary prerequisite). But would you say that the addition of a sense like vision is actively helpful?

ariana_azarbal 3 Mar 2025 18:21 UTC
2 points
0
on: Identity Alignment (IA) in AI
I really appreciate this post. I agree that AI which does not view itself as radically separate from its environment and creators has major implications for alignment, human survival, and the survival of non-human environments.
Instilling this seems delicate and I don’t have a grasp on how this could be done (especially in the first two umbrella strategies you proposed).
Regarding the reflective identity protocols, do you mean to test, in inference, whether in-context model-generated reflections impact reported sense of self? Or do you see identity reflection as being integrated into training, perhaps like chain of thought but where the end goal is some appropriate stance on an open-ended question?