The following is a subsection of a draft. Keen for feedback.
I’m currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.
Hubinger’s “Training Stories” provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock “proposal”.
Spoilers: A-LLMs are highly competitive and but unlikely to be aligned
Stretching To Fit The Framework:
1. I’m going to pretend that A-LLMs don’t exist yet and evaluate a fictitious “proposal” for creating an advanced AI via an army of open-source developers iterating and improving on A-LLM architectures.
2. The “training” is instead memetic evolution. A-LLM architectures aren’t going to be trained end-to-end by our open-source developers. But architectures that perform well or do novel things will be more likely to be forked or starred.
3. The “training goal” is intended to be a specific kind of algorithm and not just a description of what you want out of the system. As there is no unified training goal among A-LLM developers, I also mention the behavioral goal of the system.
The Proposal: What kind of algorithm are we hoping the model will learn? (Training goal specification) Training goal is supposed to be a specific class of algorithm, but there is no specific algorithm desired.
Instead we are aiming to produce a model that is capable of strategic long term planning and providing economic benefit to myself. (For example, I would like an A-LLM that can run a successful online business)
Our goal is purely behavioral and not mechanistic.
Why is that specific goal desirable? We haven’t specified any true training goal.
However, the behavioral goal of producing a capable, strategic and novel agent is desirable because it would produce a lot of economic benefit.
What are the training constraints?
We will “train” this model by having a large number of programmers each attempting to produce the most capable and impressive system.
Training is likely to only ceases due to regulation or an AGI attempt to stop the emergence of competitor AI.
If an AGI does emerge from this process, we consider this to be the model “trained” by this process.
What properties can we say it has? 1. It is capable of propogating itself (or its influence) through the world. 2. It must be capable of circumventing whatever security measures exist in the world intended to prevent this. 3. It is a capable strategic planner.
Why do you expect training to push things in the direction of the desired training goal? Again there is not a training goal.
Instead we can expect training to nudge things toward models which appear novel or economically valuable to humans. Breakthroughs and improvements will memetically spread between programmers, with the most impressive improvements rapidly spreading around the globe thanks to the power of open-source.
Evaluation: Training Goal—Alignment: Given that there is no training goal. This scores very poorly.
The final AGI would have a high chance of being unaligned with humanities interests.
Training Goal—Competitive: Given that there is no training goal, the competitiveness of the final model is not constrained in any way. The training process selects for strategic and novel behavior.
Training Rationale—Alignment: There’s no training goal, so the final model can’t be aligned with it. Further, the model doesn’t seem to have a guarantee of being aligned with any goal.
If the model is attempting to follow a specific string variable labelled “goal” given to it by it’s programmer, there’s a decent chance we end up with a paperclip maximiser.
It’s of course worth noting that there is a small chunk of people who would provide an explicitly harmful goal. (See: Chaos-GPT. Although you’ll be relieved to see that the developers appear to have shifted from trying to Roko everyone to instead running a crypto ponzi scheme)
Training Rationale—Competitiveness: A recently leaked memo from Google indicates that they feel open source is catching up to the industrial players.
Our “training” requires a large amount of manpower, but there is a large community of people who will help out with this project for free.
The largest hurdle to competitiveness would come from A-LLMs as a concept having some major, but currently unknown, flaw.
Conclusion: The proposal scores very highly in terms of competitiveness. The final model should be competitive (possibly violently so) with any rivals and the fact that people are willing to work on the project for free makes it financially viable.
Unfortunately the proposal scores very poorly on alignment and there is no real effort to ensure the model really is aligned.
It is concerning that this project is already going ahead.
“Training” Story for an Agentised-LLM turned AGI:
The following is a subsection of a draft. Keen for feedback.
I’m currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.
Hubinger’s “Training Stories” provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock “proposal”.
Spoilers: A-LLMs are highly competitive and but unlikely to be aligned
Stretching To Fit The Framework:
1. I’m going to pretend that A-LLMs don’t exist yet and evaluate a fictitious “proposal” for creating an advanced AI via an army of open-source developers iterating and improving on A-LLM architectures.
2. The “training” is instead memetic evolution. A-LLM architectures aren’t going to be trained end-to-end by our open-source developers. But architectures that perform well or do novel things will be more likely to be forked or starred.
3. The “training goal” is intended to be a specific kind of algorithm and not just a description of what you want out of the system. As there is no unified training goal among A-LLM developers, I also mention the behavioral goal of the system.
The Proposal:
What kind of algorithm are we hoping the model will learn? (Training goal specification)
Training goal is supposed to be a specific class of algorithm, but there is no specific algorithm desired.
Instead we are aiming to produce a model that is capable of strategic long term planning and providing economic benefit to myself. (For example, I would like an A-LLM that can run a successful online business)
Our goal is purely behavioral and not mechanistic.
Why is that specific goal desirable?
We haven’t specified any true training goal.
However, the behavioral goal of producing a capable, strategic and novel agent is desirable because it would produce a lot of economic benefit.
What are the training constraints?
We will “train” this model by having a large number of programmers each attempting to produce the most capable and impressive system.
Training is likely to only ceases due to regulation or an AGI attempt to stop the emergence of competitor AI.
If an AGI does emerge from this process, we consider this to be the model “trained” by this process.
What properties can we say it has?
1. It is capable of propogating itself (or its influence) through the world.
2. It must be capable of circumventing whatever security measures exist in the world intended to prevent this.
3. It is a capable strategic planner.
Why do you expect training to push things in the direction of the desired training goal?
Again there is not a training goal.
Instead we can expect training to nudge things toward models which appear novel or economically valuable to humans. Breakthroughs and improvements will memetically spread between programmers, with the most impressive improvements rapidly spreading around the globe thanks to the power of open-source.
Evaluation:
Training Goal—Alignment:
Given that there is no training goal. This scores very poorly.
The final AGI would have a high chance of being unaligned with humanities interests.
Training Goal—Competitive:
Given that there is no training goal, the competitiveness of the final model is not constrained in any way. The training process selects for strategic and novel behavior.
Training Rationale—Alignment:
There’s no training goal, so the final model can’t be aligned with it. Further, the model doesn’t seem to have a guarantee of being aligned with any goal.
If the model is attempting to follow a specific string variable labelled “goal” given to it by it’s programmer, there’s a decent chance we end up with a paperclip maximiser.
It’s of course worth noting that there is a small chunk of people who would provide an explicitly harmful goal. (See: Chaos-GPT. Although you’ll be relieved to see that the developers appear to have shifted from trying to Roko everyone to instead running a crypto ponzi scheme)
Training Rationale—Competitiveness:
A recently leaked memo from Google indicates that they feel open source is catching up to the industrial players.
Our “training” requires a large amount of manpower, but there is a large community of people who will help out with this project for free.
The largest hurdle to competitiveness would come from A-LLMs as a concept having some major, but currently unknown, flaw.
Conclusion:
The proposal scores very highly in terms of competitiveness. The final model should be competitive (possibly violently so) with any rivals and the fact that people are willing to work on the project for free makes it financially viable.
Unfortunately the proposal scores very poorly on alignment and there is no real effort to ensure the model really is aligned.
It is concerning that this project is already going ahead.