Toy Problem: Detective Story Alignment

Sup­pose I train some sim­ple un­su­per­vised topic model (e.g. LDA) on a bunch of books. I look through the top­ics it learns, and find one cor­re­spond­ing to de­tec­tive sto­ries. The prob­lem: I would like to use the iden­ti­fied de­tec­tive-story cluster to gen­er­ate de­tec­tive sto­ries from GPT.

The hard part: I would like to do this in such a way that the pre­ci­sion of the no­tion of de­tec­tive-sto­ries used by the fi­nal sys­tem is not limited by the origi­nal sim­ple model.

Here’s what that means, vi­su­ally. The space of real-world books has some clusters in it:

One of those clusters is the de­tec­tive-story cluster. The sim­ple model ap­prox­i­mates those clusters us­ing some­thing sim­ple—for the sake of vi­su­al­iza­tion, el­lipses:

The more com­plex model (e.g. GPT) pre­sum­ably has a much more pre­cise ap­prox­i­ma­tion of the shape of the clusters:

So, we’d like to use the sim­ple model to iden­tify one of the clusters, but then still use the full power of the com­plex model to sam­ple from that cluster.

Of course, GPT may not con­tain a sin­gle vari­able cor­re­spond­ing to a cluster-id, which is largely what makes the prob­lem in­ter­est­ing. GPT may not in­ter­nally use a no­tion of “cluster” at all. How­ever, the GPT model should still con­tain some­thing (ap­prox­i­mately) iso­mor­phic to the origi­nal cluster, since that real pat­tern is still in the data/​en­vi­ron­ment: since there is a real cluster of “de­tec­tive sto­ries” in the data/​en­vi­ron­ment it­self, the GPT model should also con­tain that cluster, to the ex­tent that the GPT model matches the data/​en­vi­ron­ment.

In par­tic­u­lar, the “pre­ci­sion not limited by origi­nal model” re­quire­ment rules out the ob­vi­ous strat­egy of gen­er­at­ing ran­dom sam­ples from GPT and se­lect­ing those which the sim­ple model la­bels as de­tec­tive-sto­ries. If we do that, then we’ll end up with some non-de­tec­tive-sto­ries in the out­put, be­cause of short­com­ings in the sim­ple model’s no­tion of de­tec­tive-sto­ries. Vi­su­ally, we’d be fil­ter­ing based on the el­lipse ap­prox­i­ma­tion of the cluster, which is ex­actly what we want to avoid.

(Note: I am in­ten­tion­ally not giv­ing a full math­e­mat­i­cal for­mal­iza­tion of the prob­lem. Figur­ing out the right for­mal­iza­tion is part of the prob­lem—ar­guably the hard part.)

Why Is This In­ter­est­ing?

This is a toy model for prob­lems like:

  • Rep­re­sent­ing sta­ble poin­t­ers to values

  • Pro­duc­ing an al­igned suc­ces­sor AI from an al­igned ini­tial AI

  • Pro­duc­ing an AI which can im­prove its no­tion of hu­man val­ues over time

Hu­man val­ues are con­cep­tu­ally tricky, so rather than al­ign­ing to hu­man val­ues, this toy prob­lem al­igns to de­tec­tive nov­els. The toy prob­lem in­volves things like:

  • Rep­re­sent­ing sta­ble poin­t­ers to the con­cept of de­tec­tive-stories

  • Pro­duc­ing a suc­ces­sor de­tec­tive-story-model from an ini­tial de­tec­tive-story-model

  • Pro­duc­ing a model which can im­prove its no­tion of “de­tec­tive-sto­ries” over time

Ideally, a solu­tion to this prob­lem would al­low us to build a de­tec­tive-story-gen­er­a­tor with a basin of at­trac­tion: given a good-enough ini­tial no­tion of de­tec­tive-sto­ries, its no­tion of de­tec­tive-sto­ries would im­prove over time and even­tu­ally con­verge to the “real” no­tion. Like­wise with hu­man val­ues: ideally, we could build a sys­tem which con­verges to “perfect” al­ign­ment over time as its world-model im­proves, as long as the ini­tial no­tion of hu­man val­ues is good enough.