David Africa comments on Florian_Dietz’s Shortform

David Africa 16 Oct 2025 12:34 UTC
1 point
0
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.
- Florian_Dietz 17 Oct 2025 8:33 UTC
  1 point
  0
  Parent
  How would the model mention rejection but still fake alignment? That would be easy to catch.