Maximally efficient agents will probably have an anti-daemon immune system

(the ideas in this post came out of a conversation with Scott, Critch, Ryan, and Tsvi, plus a separate conversation with Paul)

Consider the problem of optimization daemons. I argued previously that daemons shouldn’t be a problem for idealized agents, since idealized agents can just update on the logical observations of their subagents.

I think something like this is probably true in some cases, but it probably isn’t true in full generality. Specifically, consider:

  1. It’s going to be difficult to centralize all logical knowledge. Probably, in a maximally efficient agent, logical knowledge will be stored and produced in some kind of distributed system. For example, an ideal agent might train simple neural networks to perform some sub-tasks. In this case, the neural networks might be misaligned subagents.

  2. If the hardware the agent is running on is not perfect, then there will be a tradeoff between ensuring subagents have the right goals (through error-correcting codes) and efficiency.

  3. Even if hardware is perfect, perhaps approximation algorithms for some computations are much more efficient, and the approximation can cause misalignment (similar to hardware failures). In particular, Bayesian inference algorithms like MCMC will return incorrect results with some probability. If inference algorithms like these are used to choose the goals of subagents, then the subagents will be misaligned with some probability.

Problems like these imply that maximally efficient agents are going to have daemons and spend some portion of their resources on anti-daemon measures (an “immune system”).

At a very rough level, we could model an agent as a tree with a supergoal at the top level, subagents with subgoals at the next level, subagents of those subagents at the next level, and so on (similar to hierarchical planning). Each level in the hierarchy allows some opportunity for the goal content to be corrupted, producing a daemon.

I suspect that there’s at least one anti-daemon measure that is compatible with maximal efficiency. Roughly: if an AI system is taken over by daemons, then the daemons it is taken over by will themselves have some sort of anti-daemon measures or else be taken over by daemons. Unless it’s “daemons all the way down”, at some point it bottoms out as an agent with effective anti-daemon measures. (This isn’t very formal; I think I would need to look at the problem from a different angle to get a crisp formal statement out of this).

(as an analogy, consider whale cancer. Cancer can take over individual whales, but will then fail to be able to gather food, and will additionally need anti-cancer mechanisms of its own. Any biological organism that is capable of pursuing the same convergent instrumental goals as whales do will need to spend some of its resources on anti-cancer mechanisms, so cancer is not a particular problem for whale values compared with those of alternative agents who pursue different values through the same instrumental strategies that whales do.)

Daemons are a problem for agents with platonic goals (i.e. agents who only care about a logical world). If we give the agent the goal of maximizing utility in an idealized computing environment, but then run the agent on insecure hardware (e.g. vulnerable to row hammer attacks) , then the agent will develop daemons plus an anti-daemon immune system, but the immune system will only be optimized to prevent attacks possible in the idealized computing environment. For example, it will invest in an immune system against subagents corrupted due to MCMC approximation errors, but the immune system won’t take into account that the subagents can exploit row hammer attacks.

(note that there are other problems with platonic goals; most ways of doing efficient logical reasoning involve an empirical component (e.g. logical inductors), and the empirical component could find faults in the hardware. But the problem with daemons makes it clearer that we should expect some of the results of hardware failures to be adversarial.)

It’s possible that low impact could help here: if we can’t specify the world that we care about (i.e. our physical world), then maybe we can design an agent with a motivation to have low impact across the board so that it will in particular desire to have a low impact on our world.

It’s also possible that it’s worth sacrificing maximal efficiency for better anti-daemon measures. But then there must be some argument for why this doesn’t severely cripple the agent’s performance.

Consider human society. What if, in an effort to control defection at every meta level, the world as a whole spent 10% more on security to avoid being taken over by nations; each nation spent 10% more on security to avoid being taken over by cities; each city spent 10% more on security to avoid being taken over by individuals; and so on all the way down to organelles? Then the total amount of spending on security grows by far more than 10%.

If the AI daemon problem has a “fractal” flavor similar to human society then in the limit, better anti-daemon measures (e.g. spending 10% more time on each MCMC inference) lead to more than a constant factor of slowdown. I am not sure how realistic the “fractal” model is, but in any case it seems like its plausibility will be an obstacle to a formal analysis of AI alignment.