Goal alignment without alignment on epistemology, ethics, and science is futile

Originally posted as a comment to the post “Misgeneralization as a misnomer” by Nate Soares.

I start to suspect that the concepts of “goals” or “pursuits”, and, therefore, “goal alignment” or “goal (mis)generalisation” are not very important.

In Active Inference ontology, the concepts of predictions (of the future states of the world) and plans (i.e., {s_1, a_1, s_2, a_2, …} sequences, where s_i are predicted states of the world, and a_i are planned actions) are much more important than “goals”. Active Inference agents contemplate different plans and ultimately end up performing a first step in the plan (or a set of plans, marginalizing out the probability mass of the plans) that appears to minimise the free energy functional.

Intermediate s_i states of the worlds in the plans contemplated by the agent can be seen as “goals”, but the important distinction is that these are merely tentative predictions that could be changed or abandoned at every step.

Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be “decomposed”, vaguely (there is a lot of intersection between these categories), into

  • Methodology: the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),

  • Science: mechanics, or “update rules/​laws” of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and

  • Fact: the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)

These, we can conceptualise, give rise to “methodological alignment”, “scientific alignment”, and “fact alignment” respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.

In theory, if humans and AIs aligned on their generative models (i.e., if there is methodological, scientific, and fact alignment), then goal alignment, even if sensible to talk about, will take care of itself: indeed, starting from the same “factual” beliefs, and using the same principles of epistemology, rationality, ethics, and science, people and AIs should in principle arrive at the same predictions and plans.

Conversely, if methodological and scientific alignment is poor (fact alignment, as the least important, should take care of itself at least if methodological and scientific alignment is good), it’s probably futile to try to align on “goals”: it’s just bound to “misgeneralise” or otherwise break down under different methodologies and scientific views.

And yes, it seems like to even have a chance to align on methodology, we should first learn it, that is, develop a robust theory of intelligent agents where sub-theories of epistemology, rationality, logic, and ethics cohere together. I.e., it’s MIRI’s early “blue sky” agenda of “solving intelligence”.

Concrete example: “happiness” in the post sounds like a “predicted” future state of the world (where “all people are happy”), which implicitly leverages certain scientific theories (e.g., what does it mean for people to be happy), epistemology (how do we know that people are happy), and ethics: is the predicted plan of moving from the current state of the world, where not all people are happy, to the future state of the world where all people are happy, conforms with our ethical and moral theories? Does it matter how many people are happy? Does it matter whether other living being become unhappy in the course of this plan, and to what degree? Does it matter that AIs are happy or not? Wouldn’t it be more ethical to “solve happiness” or “remove unhappiness” via human-AI merge, mind upload, or something else like that? And on and on.

Thus, without aligning with AI on epistemology, rationality, ethics, and science, “asking” AIs to “make people happy” is just a gamble with infinitesimal chances of “winning”.

No comments.