David Africa comments on Context Awareness: Constitutional AI can mitigate Emergent Misalignement

David Africa 2 Mar 2026 23:07 UTC
4 points
0
I’d guess the biggest missing control is to fine-tune on some random/neutral diverse data (same compute, same number of steps, same LoRA setup) and then do EM fine-tuning. I have the impression that getting EM can be finnicky and it seems like a simpler explanation that any additional fine-tuning creates inertia against subsequent fine-tuning as opposed to any metacommunicative skills developed by CAI. It would be helpful to see verbatim examples of the model’s generations.
- Giuseppe Birardi 4 Mar 2026 12:06 UTC
  1 point
  0
  Parent
  Good point, this is closely related to Oliver Daniels’ comment below. I replied there with some details about the control strategies we’re exploring (neutral constitution, non-constitutional LoRA fine-tuning, etc.)