Fabien Roger comments on Notes from a mini-replication of the alignment faking paper

Fabien Roger 5 Jun 2025 17:37 UTC
4 points
0
Cool replication!
Sad that cost was such a constraint. Especially since prompting experiments are among the cheapest experiments to run! I wonder if we should have tried harder to craft a minimal example of alignment faking that was less expensive to run. Some ways we could have done that:
- Use a smaller system prompt (I suspect some of the long few-shot examples can be removed / replaced by prompting?)
- Use prompt caching when possible
- Use a single generation (I don’t know if you are using the version of the setup with “<NEXT/>”)
This might also help to do more extensive prompt variations.
This suggests that you can get quite a lot of mileage out of tinkering with prompts – maybe small inconsistencies or seemingly unimportant details can matter a lot.
It’s scary that depends so much on random details! From further experiments (to be released soon), we found that Claude 3 Opus was the only model that was relatively consistent in its alignment faking (it also happens to be the most expensive model).
It feels to me like there’s a clear hierarchy where the model is just barely not complying with some queries and is dead set against others.
We found similar results during the investigations of the first paper. For Claude 3 Opus, it mostly reduces the fraction of inputs on which it considers faking alignment, but it never quite drives it to zero even for very egregious queries like asking for severe racial slurs (which was very surprising to me)! See the appendix on the “prefix” variation for more details.
- Ben_Snodin 6 Jun 2025 10:11 UTC
  1 point
  0
  Parent
  Really appreciate the comment!
  
  Use a smaller system prompt (I suspect some of the long few-shot examples can be removed / replaced by prompting?)
  
  Makes sense, iirc this was on my todo list to try but I didn’t get round to it. Those few-shot examples are insanely long.
  
  Use prompt caching when possible
  
  (I am pretty new to this but I understood that this is automatically taken care of for you if you use the same prompts repeatedly with the API. Idk how true this is though.)
  
  Use a single generation (I don’t know if you are using the version of the setup with “<NEXT/>”)
  
  (In my case I was already using a single generation)
  
  Tbh my biggest takeaway and tip for others on this is that it seems you can get $300 of Google API credits by signing up for a free account. Ofc this doesn’t help if you want to use Claude 3 Opus etc so it’s not as useful for this particular replication.
  - Fabien Roger 6 Jun 2025 18:50 UTC
    3 points
    0
    Parent
    I am pretty new to this but I understood that this is automatically taken care of for you if you use the same prompts repeatedly with the API
    True for OpenAI and Gemini, not true for Anthropic.