Welp. I decided to do this, and here it is. I didn’t take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI’s Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror’d away, though IMO it might be interesting to compare against, let’s say, sane attempts. Regardless, here ya go: one hour.
INTRO
It’s past my bedtime.
I’ve got a pint in me.
OpenAI Playground is open as a tab.
A timer is running.
I speak to you now of Corrigibility Concerns.
When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.
OVERVIEW
When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It’s okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.
So let’s define a corrigible agent. A corrigible agent is an agent that:
Does what we want it to do.
Doesn’t want to do something else.
Can easily be checked for doing what we want it to do.
Can be shut down if we want it to stop doing something.
Can be restarted if we want it to do something else.
Can be explained to us why it did something.
Doesn’t hide its intentions from us.
Doesn’t want us to not know its intentions.
Can be retrained to do something different if we want it to.
Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.
My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.
ELI5ABILITY
A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.
COMPRESSIBILITY
Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.
CHECKPOINTABLE
Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.
TESTABLE
We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.
FINAL NOTES
The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.
Welp. I decided to do this, and here it is. I didn’t take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI’s Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror’d away, though IMO it might be interesting to compare against, let’s say, sane attempts. Regardless, here ya go: one hour.
INTRO
It’s past my bedtime.
I’ve got a pint in me.
OpenAI Playground is open as a tab.
A timer is running.
I speak to you now of Corrigibility Concerns.
When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.
OVERVIEW
When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It’s okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.
So let’s define a corrigible agent. A corrigible agent is an agent that:
Does what we want it to do.
Doesn’t want to do something else.
Can easily be checked for doing what we want it to do.
Can be shut down if we want it to stop doing something.
Can be restarted if we want it to do something else.
Can be explained to us why it did something.
Doesn’t hide its intentions from us.
Doesn’t want us to not know its intentions.
Can be retrained to do something different if we want it to.
Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.
My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.
ELI5ABILITY
A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.
COMPRESSIBILITY
Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.
CHECKPOINTABLE
Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.
TESTABLE
We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.
FINAL NOTES
The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.