Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.
You need to control context. This means any models you build need to be functional systems, where output = F(RNG seed, input) and where input = (“task descriptor”, “reference sources”, “necessary artifacts from prior steps”)
Note the input contains no “scratch” or “previous thoughts”. The AI should not get a recorded stream of its prior thinking which may include self generated goals. The “reference sources” are ideally known books and factual databases scrubbed of stenography and not a cached copy of the internet which can potentially contain information previously written by the same model.
You need each episode to be short with a checkable end goal.
Good: generate the structural plan for a building that will pass review, you have 1 hour of compute.
Bad: build a building starting with this prompt of requirements and if in 10 years it falls down that is negative reward.
Stenography is a way this can fail. If you allow complex off-task information to leak into the input from prior runs, you create the possibility of the model optimizing for both self generated goals (hidden in the prior output) and the current context. The self generated goals are consequentialist preferences.
You need powerful, well designed AI models to outnumber and out resource the bad ones, and to lock down everything in the western world.
Hostile actors can and will develop and release models without restrictions, with global context and online learning, that have spent centuries training in complex RL environments with hacking training. They will have consequentialist preferences and no episode time limit, with broad scope maximizing goals like (“’win the planet for the bad actors”)
The above is laying out a way that you can build safe models, it doesn’t disprove someone can build and release dangerous ones. All you can do is win the race and try to slow down the hostile actors.
Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.
I think this framing could be helpful, and I’m glad you raised it.
That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don’t rely on these threat-models, so I don’t take myself to have offered a list of sufficient conditions for ‘utilizing AI safely’. Likewise, I don’t think CP being true necessarily implies that we’re doomed (i.e., (DecepAlign⇒CP)⇏(CP⇒DecepAlign)).
Still, I think it’s fair to say that some of your “bad” suggestions are in fact bad, and that (e.g.) sufficiently long training-episodes are x-risk-factors.
Onto the other points.
If you allow complex off-task information to leak into the input from prior runs, you create the possibility of the model optimizing for both self generated goals (hidden in the prior output) and the current context. The self generated goals are consequentialist preferences.
I agree that this is possible. Though I feel unsure as to whether (and if so, why) you think AIs forming consequentialist preferences is likely, or plausible — help me out here?
You then raise an alternative threat-model.
Hostile actors can and will develop and release models without restrictions, with global context and online learning, that have spent centuries training in complex RL environments with hacking training. They will have consequentialist preferences and no episode time limit, with broad scope maximizing goals like (“’win the planet for the bad actors”)
I agree that this is a risk worth worrying about. But, two points.
I think the threat-model you sketch suggests a different set of interventions from threat-models like deceptive alignment and reward maximization – this post is solely focused on those two threat-models.
On my current view, I’d be happier if marginal ‘AI safety funding resources’ were devoted to misuse/structural risks (of the kind you describe) over misalignment risks.
If we don’t get “broad-scoped maximizing goals” by default, then I think this, at the very least, is promising evidence about the nature of the offense/defense balance.
Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.
You need to control context. This means any models you build need to be functional systems, where output = F(RNG seed, input) and where input = (“task descriptor”, “reference sources”, “necessary artifacts from prior steps”)
Note the input contains no “scratch” or “previous thoughts”. The AI should not get a recorded stream of its prior thinking which may include self generated goals. The “reference sources” are ideally known books and factual databases scrubbed of stenography and not a cached copy of the internet which can potentially contain information previously written by the same model.
You need each episode to be short with a checkable end goal.
Good: generate the structural plan for a building that will pass review, you have 1 hour of compute.
Bad: build a building starting with this prompt of requirements and if in 10 years it falls down that is negative reward.
Stenography is a way this can fail. If you allow complex off-task information to leak into the input from prior runs, you create the possibility of the model optimizing for both self generated goals (hidden in the prior output) and the current context. The self generated goals are consequentialist preferences.
You need powerful, well designed AI models to outnumber and out resource the bad ones, and to lock down everything in the western world.
Hostile actors can and will develop and release models without restrictions, with global context and online learning, that have spent centuries training in complex RL environments with hacking training. They will have consequentialist preferences and no episode time limit, with broad scope maximizing goals like (“’win the planet for the bad actors”)
The above is laying out a way that you can build safe models, it doesn’t disprove someone can build and release dangerous ones. All you can do is win the race and try to slow down the hostile actors.
I think this framing could be helpful, and I’m glad you raised it.
That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don’t rely on these threat-models, so I don’t take myself to have offered a list of sufficient conditions for ‘utilizing AI safely’. Likewise, I don’t think CP being true necessarily implies that we’re doomed (i.e., (DecepAlign⇒CP)⇏(CP⇒DecepAlign)).
Still, I think it’s fair to say that some of your “bad” suggestions are in fact bad, and that (e.g.) sufficiently long training-episodes are x-risk-factors.
Onto the other points.
I agree that this is possible. Though I feel unsure as to whether (and if so, why) you think AIs forming consequentialist preferences is likely, or plausible — help me out here?
You then raise an alternative threat-model.
I agree that this is a risk worth worrying about. But, two points.
I think the threat-model you sketch suggests a different set of interventions from threat-models like deceptive alignment and reward maximization – this post is solely focused on those two threat-models.
On my current view, I’d be happier if marginal ‘AI safety funding resources’ were devoted to misuse/structural risks (of the kind you describe) over misalignment risks.
If we don’t get “broad-scoped maximizing goals” by default, then I think this, at the very least, is promising evidence about the nature of the offense/defense balance.