Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the “corrigibility” and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I’ve decided that whenever I need a name for something I don’t know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)
-
Corrigibility
An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.
Different parts of the corrigible AI may be restricted relative to an idealized agent—world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.
Restricted world-modeling
Restricted world-modeling is a common reason for AI to be safe. For example, an AI designed to play the computer game Brick-Break may choose the action that maximizes its score, which would be unsafe if actions were evaluated using a complete model of the world. However, if actions are evaluated using a simulation of the game of Brick-Break, or if the AI’s world model is otherwise restricted to modeling the game, then it is likely to choose actions that are safe.
Many proposals for “tool AI” or “science AI” fall into this category. If we can create a closed model of a domain (e.g. the electronic properties of crystalline solids), and simple objectives within that domain correspond to solutions to real-world problems (e.g. superconductor design), then learning and search within the model can be safe yet valuable.
It may seem that these solutions do not apply when we want to use the AI to solve problems that require learning about the world in general. However, some closely related avenues are being explored.
Perhaps the simplest is to identify things that we don’t want the AI to think about, and exclude them from the world-model, while still having a world-model that encompasses most of the world. For example, an AI that deliberately doesn’t know about the measures humans have put in place to shut it off, or an AI that doesn’t have a detailed understanding of human psychology. However, this can be brittle in practice, because ignorance incentivizes learning. For more on the learning problem, see restricted learning § doublethink.
Real time intervention on AI designs that have more dynamic interactions between their internal state and the world model falls under the umbrella of thought policing and policeability. This intersects with altering the action-choosing procedure to select policies that do not violate certain rules, see § deontology.
Counterfactual agency
A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.
The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions’ consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.
The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI’s outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically.
Counterfactual agency is also related to constructing agents that act as if they are ignorant of certain pieces of knowledge. Taking the previous example of an AI that doesn’t know about human psychology, it might still use learning to produce an accurate world model, but make decisions by predicting the consequences in an edited world model that has less precise predictions for humans, and also freezes those predictions, particularly in value of information calculations. Again, see restricted learning § doublethink.
Mild optimization
We might hope to lessen the danger of agents by reducing how effectively they search the space of solutions, or otherwise restricting that search.
The two simplest approaches are whitelisting or blacklisting. Both restrict the search result to a set that fulfills some pre-specified criteria. Blacklisting refers to permissive criteria, while whitelisting refers to restrictive ones. Both face difficulty in retaining safety properties while solving problems in the real world.
Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of “corrigibility” or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.
Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the “corrigibility” and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I’ve decided that whenever I need a name for something I don’t know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)
-
Corrigibility
An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.
Different parts of the corrigible AI may be restricted relative to an idealized agent—world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.
Restricted world-modeling
Restricted world-modeling is a common reason for AI to be safe. For example, an AI designed to play the computer game Brick-Break may choose the action that maximizes its score, which would be unsafe if actions were evaluated using a complete model of the world. However, if actions are evaluated using a simulation of the game of Brick-Break, or if the AI’s world model is otherwise restricted to modeling the game, then it is likely to choose actions that are safe.
Many proposals for “tool AI” or “science AI” fall into this category. If we can create a closed model of a domain (e.g. the electronic properties of crystalline solids), and simple objectives within that domain correspond to solutions to real-world problems (e.g. superconductor design), then learning and search within the model can be safe yet valuable.
It may seem that these solutions do not apply when we want to use the AI to solve problems that require learning about the world in general. However, some closely related avenues are being explored.
Perhaps the simplest is to identify things that we don’t want the AI to think about, and exclude them from the world-model, while still having a world-model that encompasses most of the world. For example, an AI that deliberately doesn’t know about the measures humans have put in place to shut it off, or an AI that doesn’t have a detailed understanding of human psychology. However, this can be brittle in practice, because ignorance incentivizes learning. For more on the learning problem, see restricted learning § doublethink.
Real time intervention on AI designs that have more dynamic interactions between their internal state and the world model falls under the umbrella of thought policing and policeability. This intersects with altering the action-choosing procedure to select policies that do not violate certain rules, see § deontology.
Counterfactual agency
A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.
The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions’ consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.
The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI’s outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically.
Counterfactual agency is also related to constructing agents that act as if they are ignorant of certain pieces of knowledge. Taking the previous example of an AI that doesn’t know about human psychology, it might still use learning to produce an accurate world model, but make decisions by predicting the consequences in an edited world model that has less precise predictions for humans, and also freezes those predictions, particularly in value of information calculations. Again, see restricted learning § doublethink.
Mild optimization
We might hope to lessen the danger of agents by reducing how effectively they search the space of solutions, or otherwise restricting that search.
The two simplest approaches are whitelisting or blacklisting. Both restrict the search result to a set that fulfills some pre-specified criteria. Blacklisting refers to permissive criteria, while whitelisting refers to restrictive ones. Both face difficulty in retaining safety properties while solving problems in the real world.
quantilizers
intervening at intermediate reasoning steps
learned human reasoning patterns
Restrictions on consequence-ranking criteria
general deontology
myopia
impact regularization
Human oversight
Types of human oversight
Via restrictions on consequence-ranking
Via counterfactual agency
Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of “corrigibility” or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.