Humans historically have been very bad at writing well-formed, nailed down specifications of what goodness is, how good behaviour “works”, or what a good character “looks like”. The exceptions to this are generally found in literature, poetry, great works of art etc which are pretty far from the AI labs’ wheelhouse. This suggests that (insofar as a characer spec will be nailed down and concrete in the ways that differ from standard refusal or post-training) it will fail to capture the unwritten or tacit knowledge that makes human character “good” or “nice”. Thus, getting what you asked for may not be getting what you want, and spending lots of time and work getting what you asked for (i.e. designing elaborate post training protocols) may actually train out behaviour that is actually good but not specified in the spec.
Humans historically have been very bad at writing well-formed, nailed down specifications of what goodness is, how good behaviour “works”, or what a good character “looks like”. The exceptions to this are generally found in literature, poetry, great works of art etc which are pretty far from the AI labs’ wheelhouse. This suggests that (insofar as a characer spec will be nailed down and concrete in the ways that differ from standard refusal or post-training) it will fail to capture the unwritten or tacit knowledge that makes human character “good” or “nice”. Thus, getting what you asked for may not be getting what you want, and spending lots of time and work getting what you asked for (i.e. designing elaborate post training protocols) may actually train out behaviour that is actually good but not specified in the spec.