formal alignment: what it is, and some proposals

Link post

what i call “formal alignment” is an approach to solving AI alignment that consists of:

  • designing a formal goal, utility function, or decision process, which actually leads to desirable outcomes when pursued

  • building an AI that pursues such a goal, utility function, or decision process

those two points correspond to formal alignment’s notions of outer and inner alignment, respectively: determining what formal thing to align the AI to, and figuring out how to build something that is indeed aligned to it without running into inner misalignment issues.

for reasons why i think this is the least hopeless path to saving the world, see my outlook on AI risk mitigation. the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question “will maximizing this with arbitrary capabilities produce desirable outcomes?” where the answer does not depend, at the limit, on what does the maximization. and the fact that such a formal thing is aligned in the limit makes it robust to sharp left turns. what remains then is just “bridging the gap”: getting from eventual to continuous alignment, perhaps by ensuring the right ordering of attained capabilities.

potential formal alignment ideas include:

  • June Ku’s metaethical AI (MAI): describing ethics directly, i think?

  • plex’s universal alignment test (UAT): throwing a weird simulation hypothesis at the AI which encourages it to align itself

  • Vanessa Kosoy’s PreDCA: making the AI implement its human predecessor’s values (as i understand PreDCA is not designed to be used as a formal alignment goal, but it seems like it might be able to fill that role)

  • my insulated goal-programs (IGP): aligning the AI to the simple goal of running a program which we’d expect to eventually contains desirable worlds

  • my question-answer couterfactual interval (QACI): use the AI’s past user’s counterfactual answers to various questions as its signal for aligned decisions (see also my attempt at formalizing QACI)

if there are formal alignment ideas i’m missing, please tell me about them and i’ll add them here.

because these various proposals consist of putting together a formal mathematical expression, they rely on finding various true names. for example: PreDCA tries to put together the true names for causality, agency, and the AI’s predecessor; IGP requires the true name for computing a program forwards; QACI requires a true name for identifying pieces of data in causal worlds, and replacing them with counterfactual alternatives; UAT requires the true names for parent universe/​simulation, control over resources, and comparing amounts of resources with those in the AI’s future lightcone.

see also: clarifying formal alignment implementation