Formalising Catastrophic Goodhart

This sequence partially overlaps with our paper Extinction Risk from AI: Invisible to Science?.

In this sequence of posts, I will try to lay out some of my thoughts related to extreme versions of over-optimisation and its relation to extinction risk from AI.

Some ways in which I hope this to be useful, roughly in the order of importance:

  1. This seems important for advocacy. Some of these findings indicate that it might be exceedingly difficult to create good formal models that show that extinction from AI is a threat—even if it is. At the same time, it seems there are quite a few people who are sceptical of AI risk precisely because of the inability to give a rigorous description of the threat model. Consequently, I think it should be possible to re-package some of the ideas to be (possibly very) useful for advocacy. But I am no expert on this, so this is where others can help.

  2. Introducing some simple concepts that improve alignment discussions. In particular, I quite like the notion of a weak Extinction-Level Goodhart’s Law, informally defined as “if you optimise an imperfectly aligned goal sufficiently hard, you die”. This is because you might agree with (some version of) this law, yet be skeptical about AI risk (because you expect “sufficiently hard” to be unreachable in practice).

  3. Pointing towards a research agenda. Some of the notions I discuss are, at the moment, only informally defined. However, I think the posts point at a research agenda on “formalising Catastrophic Goodhart’s Law”, in the sense that they can be read as “If we ever get a formal theory of this topic, here is what some of the important bits—examples, observations, results, and theorems—are going to be. Now please fill in the details.”

At the moment, the sequence contains the following posts:

Weak vs Quan­ti­ta­tive Ex­tinc­tion-level Good­hart’s Law

Ex­tinc­tion Risks from AI: In­visi­ble to Science?

Which Model Prop­er­ties are Ne­c­es­sary for Eval­u­at­ing an Ar­gu­ment?

Dy­nam­ics Cru­cial to AI Risk Seem to Make for Com­pli­cated Models

Ex­tinc­tion-level Good­hart’s Law as a Prop­erty of the Environment