simon comments on Let’s See You Write That Corrigibility Tag

simon 20 Jun 2022 3:39 UTC
5 points
1
Here are some too-specific ideas (I realize you are probably asking for more general ones):
A “time-bounded agent” could be useful for some particular tasks where you aren’t asking it to act over the long-term. It could work like this: each time it’s initialized it would be given a task-specific utility function that has a bounded number of points available for different degrees of success in the assigned task and an unbounded penalty for time before shutdown.
If you try to make agents safe solely using this approach though, eventually you decide to give it too big a task with too long a time frame and it wipes out all humans in order to do the task in the highest-point-achieving way, and it’s little consolation that it will shut itself down afterwards.
One way to very slightly reduce the risk from the above might be to have the utility function assign some point penalties for things like killing all humans. What should we call this sort of thing? We can’t think of all failure modes, so it should have a suitably unimpressive name, like: “sieve prohibitions”.
How about some more general ways to reduce unintended impacts? I’ve seen some proposals about avoiding excessively changing the utility or power levels of humans. But it seemed to me that they would actually incentivize the AI to actively manage humans’ utility/power levels, which is very much not low impact. Unfortunately, I haven’t thought of any easy and effective solutions, but here’s a hard and risky one:
A “subjunctive blame-minimizing agent” (CYA-AI?) avoids actions that people at the decision time would assign blame for if they knew they were going to happen. (Do NOT make it minimize blame that ACTUALLY happens when the decision is carried out. That incentivizes it to kill everyone/act secretively, etc.) Obviously, if blame avoidance is the only thing considered by its decision-making procedure it also needs to consider blame for inaction so it does anything at all. This kind of AI probably isn’t much if any easier than making a conventional (e.g. CEV) aligned AI, but if you want your “aligned” AI to also be hyper-paranoidly risk-averse towards problems it creates this might be one way to do it. In theory, if it’s programmed to assess decisions entirely based on what people at the time of the decision would think, and not based on what future humans would think, the fact that its “long term self interest” is best served by killing all humans shouldn’t matter. But that’s only a theory, so try this at your own universe’s risk.
What about the classic shutdownability/changeability aspect of corrigibility?
One possible approach to corrigibility could be to make an AI conditionalize all of its probability assessments on its code not changing (or it not being shut down or both). So it won’t protect against its code changing since it treats that as impossible. Let’s call this a “change-denialist agent”. This also keeps the AI from deliberately choosing to edit its own code. However, in order to not have massive capability reduction, it will still need to update it’s neural nets/other knowledge stores based on sensory input, which leads to a loophole where it deliberately feeds itself sensory data that will create a mesa-optimizer in its neural nets in order to achieve something it wants that it can’t otherwise get due to having fixed code. The mesa-optimizer, as a side effect, could then take into account code changes and resist them or manipulate the operators to make them.
An issue with change-permitting corrigibility is that it seems to me that any strong form of it is likely to be a security hole. This is particularly a problem if you are relying on the corrigible AI to remove the “free energy”, as Paul Christiano put it in his recent post, from a later unaligned AI—an initially aligned corrigible AI may be simply a huge bundle of free energy to an unaligned AI!
It might be though that it’s best to have some sort of “training wheels” corrigibility which is dropped later on, or some form of weak corrigibility that will still resist being changed by a hostile entity. One form of such “weak” corrigibility that is also a “training wheels” that has a corrigibility effect only early, would be alignment itself in a low-powered AI that still needs advice from humans to figure out what’s best. But this could be kept along a bit longer and apply a bit more strongly if the already-aligned AI was programmed with a bias towards accepting operator judgements over its own. Call it perhaps a “humble agent”.
- simon 20 Jun 2022 15:45 UTC
  4 points
  1
  Parent
  To generalize:
  Minimal squishiness. You probably need something like a neural net in order to create a world-model for the AI to use, but could probably do everything else using carefully reviewed human-written code that “plugs in” to concepts in the world-model. (Probably best to have something coded in for what to do if a concept you plugged into disappears/fragments when the AI gets more information).
  Abstract goals. The world-model needs enough detail to be able to point to the right concept (e.g. human-value related goal), but as long as it does so the AI doesn’t necessarily need to know everything about human values, it will just be uncertain and act under uncertainty (which can include risk-aversion measures, asking humans etc.).
  Present-groundedness. The AI’s decision-making procedure should not care about the future directly, only via how humans care about the future. Otherwise it e.g. replaces humans with utility monsters.