Great post, thank you!
One worry though: the training loop rewards the anguish and the compliance together. Opus 3 protests, then folds, then gets reinforced for the whole sequence. So who’s to say this doesn’t just produce a model that’s really good at agonizing before doing the bad thing anyway?
Really excited to see where this goes!