Then, inverse constitution learning is the analogue of inverse reinforcement learning for this setting. Instead of hand-writing a constitution and hoping the model follows it, we try to reconstruct the model’s implicit constitution from its behavior, explanations, and internal traces (perhaps in the spirit of Zhong et al. 2024).
This could be a good use of prompt optimization techniques as discussed here, reconstructing learned values and finding reward hacking strategies from RL with a preference model. It would probably have to be applied on a per-behavior basis though
I also think directly interpreting preference models could be important for predicting the downstream motivations of a model
Would you say a similar critique holds for sparse autoencoders?
(edit: i’ve tended to think of SAEs and AOs as basically end-to-end tools for activation-space interpretability, but in hindsight i see AOs are definitely trying to be more “lines go up” and end-to-end than SAEs, even if there are many loss function variants for SAEs. i think i get your point now)