Toy Problem: Detective Story Alignment

Suppose I train some simple unsupervised topic model (e.g. LDA) on a bunch of books. I look through the topics it learns, and find one corresponding to detective stories. The problem: I would like to use the identified detective-story cluster to generate detective stories from GPT.

The hard part: I would like to do this in such a way that the precision of the notion of detective-stories used by the final system is not limited by the original simple model.

Here’s what that means, visually. The space of real-world books has some clusters in it:

One of those clusters is the detective-story cluster. The simple model approximates those clusters using something simple—for the sake of visualization, ellipses:

The more complex model (e.g. GPT) presumably has a much more precise approximation of the shape of the clusters:

So, we’d like to use the simple model to identify one of the clusters, but then still use the full power of the complex model to sample from that cluster.

Of course, GPT may not contain a single variable corresponding to a cluster-id, which is largely what makes the problem interesting. GPT may not internally use a notion of “cluster” at all. However, the GPT model should still contain something (approximately) isomorphic to the original cluster, since that real pattern is still in the data/​environment: since there is a real cluster of “detective stories” in the data/​environment itself, the GPT model should also contain that cluster, to the extent that the GPT model matches the data/​environment.

In particular, the “precision not limited by original model” requirement rules out the obvious strategy of generating random samples from GPT and selecting those which the simple model labels as detective-stories. If we do that, then we’ll end up with some non-detective-stories in the output, because of shortcomings in the simple model’s notion of detective-stories. Visually, we’d be filtering based on the ellipse approximation of the cluster, which is exactly what we want to avoid.

(Note: I am intentionally not giving a full mathematical formalization of the problem. Figuring out the right formalization is part of the problem—arguably the hard part.)

Why Is This Interesting?

This is a toy model for problems like:

  • Representing stable pointers to values

  • Producing an aligned successor AI from an aligned initial AI

  • Producing an AI which can improve its notion of human values over time

Human values are conceptually tricky, so rather than aligning to human values, this toy problem aligns to detective novels. The toy problem involves things like:

  • Representing stable pointers to the concept of detective-stories

  • Producing a successor detective-story-model from an initial detective-story-model

  • Producing a model which can improve its notion of “detective-stories” over time

Ideally, a solution to this problem would allow us to build a detective-story-generator with a basin of attraction: given a good-enough initial notion of detective-stories, its notion of detective-stories would improve over time and eventually converge to the “real” notion. Likewise with human values: ideally, we could build a system which converges to “perfect” alignment over time as its world-model improves, as long as the initial notion of human values is good enough.