WillPetillo comments on What if Alignment is Not Enough?

WillPetillo 30 Jan 2025 8:44 UTC
1 point
0
I’d like to attempt a compact way to describe the core dilemma being expressed here.

Consider the expression: y = x^a—x^b, where ‘y’ represents the impact of AI on the world (positive is good), ‘x’ represents the AI’s capability, ‘a’ represents the rate at which the power of the control system scales, and ‘b’ represents the rate at which the surface area of the system that needs to be controlled (for it to stay safe) scales.

(Note that this is assuming somewhat ideal conditions, where we don’t have to worry about humans directing AI towards destructive ends via selfishness, carelessness, malice, etc.)

If b > a, then as x increases, y gets increasingly negative. Indeed, y can only be positive when x is less than 1. But this represents a severe limitation on capabilities, enough to prevent it from doing anything significant enough to hold the world on track towards a safe future, such as preventing other AIs from being developed.

There are two premises here, and thus two relevant lines of inquiry:
1) b > a, meaning that complexity scales faster than control.
2) When x < 1, AI can’t accomplish anything significant enough to avert disaster.

Arguments and thought experiments where the AI builds powerful security systems can be categorized as challenges to premise 1; thought experiments where the AI limits its range of actions to prevent unwanted side effects—while simultaneously preventing destruction from other sources (including other AIs built)--are challenges to premise 2.

Both of these premises seem like factual statements relating to how AI actually works. I am not sure what to look for in terms of proving them (I’ve seen some writing on this relating to control theory, but the logic was a bit too complex for me to follow at the time).