I’m a rookie in the LessWrong/rationality community, but have been involved in tech for five decades. I also have real-life experience dealing with people who act irrationally and consciously accept the risks of getting caught violating societal norms.
This is a fascinating analysis! I’m here because I’m a fan of the rational approach (mostly), so this question may be a bit naive. It stems from my experience in government bureaucracy and how it attempts to prevent violations and identify violators. It does this through an independent third-party audit process of an activity.
Would it be practical to train a language model to audit a model for unsafe behavior? That way a CoT could be examined for improper or contradictory behavior (assuming that can be defined and be trainable) as part of the cycle? The audit model could then report to the user (and/or model creator) what would be unsafe behavior.
I realize this approach would add more tokens to the process, but it would seem to be a marginal added cost. I can see variations on the theme of how that would be implemented, but the central idea is an audit, done by a separate model, not the originating one.
I’m a rookie in the LessWrong/rationality community, but have been involved in tech for five decades. I also have real-life experience dealing with people who act irrationally and consciously accept the risks of getting caught violating societal norms.
This is a fascinating analysis! I’m here because I’m a fan of the rational approach (mostly), so this question may be a bit naive. It stems from my experience in government bureaucracy and how it attempts to prevent violations and identify violators. It does this through an independent third-party audit process of an activity.
Would it be practical to train a language model to audit a model for unsafe behavior? That way a CoT could be examined for improper or contradictory behavior (assuming that can be defined and be trainable) as part of the cycle? The audit model could then report to the user (and/or model creator) what would be unsafe behavior.
I realize this approach would add more tokens to the process, but it would seem to be a marginal added cost. I can see variations on the theme of how that would be implemented, but the central idea is an audit, done by a separate model, not the originating one.
Or, am I oblivious to something here?