List of resolved confusions about IDA

AI Align­ment is a con­fus­ing topic in gen­eral, but even com­pared to other al­ign­ment top­ics, IDA seems es­pe­cially con­fus­ing. Some of it is surely just due to the na­ture of com­mu­ni­cat­ing sub­tle and un­finished re­search ideas, but other con­fu­sions can be cleared up with more spe­cific lan­guage or ad­di­tional ex­pla­na­tions. To help peo­ple avoid some of the con­fu­sions I or oth­ers fell into in the past while try­ing to un­der­stand IDA (and to re­mind my­self about them in the fu­ture), I came up with this list of past con­fu­sions that I think have mostly been re­solved at this point. (How­ever there’s some chance that I’m still con­fused about some of these is­sues and just don’t re­al­ize it. I’ve in­cluded refer­ences to the origi­nal dis­cus­sions where I think the con­fu­sions were cleared up so you can judge for your­self.)

I will try to main­tain this list as a pub­lic refer­ence so please provide your own re­solved con­fu­sions in the com­ments.

al­ign­ment = in­tent alignment

At some point Paul started us­ing “al­ign­ment” re­fer to the top-level prob­lem that he is try­ing to solve, and this prob­lem is nar­rower (i.e., leaves more safety prob­lems to be solved el­se­where) than the prob­lem that other peo­ple were us­ing “al­ign­ment” to de­scribe. He even­tu­ally set­tled upon “in­tent al­ign­ment” as the for­mal term to de­scribe his nar­rower prob­lem, but oc­ca­sion­ally still uses just “al­igned” or “al­ign­ment” as short­hand for it. Source

short-term prefer­ences ≠ nar­row preferences

At some point Paul used “short-term prefer­ences” and “nar­row prefer­ences” in­ter­change­ably, but no longer does (or at least no longer en­dorses do­ing so). Source

prefer­ences = “ac­tual” prefer­ences (e.g., prefer­ences-on-re­flec­tion)

When Paul talks about prefer­ences he usu­ally means “ac­tual” prefer­ences (for ex­am­ple the prefer­ences some­one would ar­rive at af­ter hav­ing a long time to think about it while hav­ing ac­cess to helpful AI as­sis­tants, if that’s a good way to find some­one’s “ac­tual” prefer­ences). He does not mean their cur­rent re­vealed prefer­ences or the prefer­ences they would state or en­dorse now if you were to ask them. Source

cor­rigi­bil­ity ≠ based on short-term preferences

I had mi­s­un­der­stood Paul to be us­ing “cor­rigi­bil­ity to X” as syn­ony­mous with “based on X’s short-term prefer­ences”. Ac­tu­ally “based on X’s short-term prefer­ences” is a way to achieve cor­rigi­bil­ity to X, be­cause X’s short-term prefer­ences likely in­cludes “be cor­rigible to X” as a prefer­ence. “Cor­rigi­bil­ity” it­self means some­thing like “al­lows X to mod­ify the agent” or a gen­er­al­iza­tion of this con­cept. Source

act-based = based on short-term prefer­ences-on-reflection

My un­der­stand­ing is that “act-based agent” used to mean some­thing differ­ent (i.e., a sim­pler kind of AI that tries to do the same kind of ac­tion that a hu­man would), but most peo­ple nowa­days use it to mean an AI that is de­signed to satisfy some­one’s short-term prefer­ences-on-re­flec­tion, even though that no longer seems par­tic­u­larly “act-based”. Source

act-based corrigibility

Evan Hub­inger used “act-based cor­rigi­bil­ity” to mean both a method of achiev­ing cor­rigi­bil­ity (based on short-term prefer­ences) and the kind of cor­rigi­bil­ity achieved by that method. (I’m not sure if he still en­dorses us­ing the term this way.) Source

learn­ing user prefer­ences for cor­rigi­bil­ity isn’t enough for cor­rigible behavior

Be­cause an act-based agent is about “ac­tual” prefer­ences not “cur­rent” prefer­ences, it may be in­cor­rigible even if it cor­rectly learns that the user cur­rently prefers the agent to be cor­rigible, if it in­cor­rectly in­fers or ex­trap­o­lates the user’s “ac­tual” prefer­ences, or if the user’s “ac­tual” prefer­ences do not ac­tu­ally in­clude cor­rigi­bil­ity as a prefer­ence. (ETA: Although in the lat­ter case pre­sum­ably the “ac­tual” prefer­ences in­clude some­thing even bet­ter than cor­rigi­bil­ity.) Source

dis­till ≈ RL

Sum­maries of IDA of­ten de­scribe the “dis­till” step as us­ing su­per­vised learn­ing, but Paul and oth­ers work­ing on IDA to­day usu­ally have RL in mind for that step. Source

outer al­ign­ment prob­lem ex­ists? = yes

The ex­ist­ing liter­a­ture on IDA (in­clud­ing a post about “re­ward en­g­ineer­ing”) seems to have ne­glected to de­scribe an outer al­ign­ment prob­lem as­so­ci­ated with us­ing RL for dis­til­la­tion. (Analo­gous prob­lems may also ex­ist if us­ing other ML tech­niques such as SL.) Source

cor­rigible to the user? ≈ no

IDA is typ­i­cally de­scribed as be­ing cor­rigible to the user. But in re­al­ity it would be try­ing to satisfy a com­bi­na­tion of prefer­ences com­ing from the end user, the AI de­vel­oper/​over­seer, and even law en­force­ment or other gov­ern­ment agen­cies. I think this means that “cor­rigible to the user” is very mis­lead­ing, be­cause the AI is ac­tu­ally not likely to re­spect the user’s prefer­ences to mod­ify (most as­pects of) the AI or to be “in con­trol” of the AI. Sources: this com­ment and a talk by Paul at an AI safety workshop

strat­egy steal­ing ≠ liter­ally steal­ing strategies

When Paul says “strat­egy steal­ing” he doesn’t mean ob­serv­ing and copy­ing some­one else’s strat­egy. It’s a term bor­rowed from game the­ory that he’s us­ing to re­fer to com­ing up with strate­gies that are as effec­tive as some­one else’s strat­egy in terms of gain­ing re­sources and other forms of flex­ible in­fluence. Source