I have a post from a while back with a section that aims to do much the same thing you’re doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call “inner alignment for characters”, I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we’re sure that that’s what it’s actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that’s not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what’s the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?
I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel… benign? Like, in the vein of “these don’t pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems”, rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the “risk” aspect collapses down into outer and inner alignment for simulators again.
I have a post from a while back with a section that aims to do much the same thing you’re doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call “inner alignment for characters”, I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we’re sure that that’s what it’s actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that’s not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what’s the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?
I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel… benign? Like, in the vein of “these don’t pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems”, rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the “risk” aspect collapses down into outer and inner alignment for simulators again.