If one of our key defenses against misuse of AI is good ol’ value alignment—building AIs that have some notion of what a “good purpose for them” is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
Yep, I think this is a plausible suggestion. Labs can plausibly train models that are v internally useful without being helpful only, and could fine-tune models for evals on a case-by-case basis (and delete the weights after the evals).
I expected this comment, value alignment or CEV indeed doesn’t have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you’re locked in and there’s no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can’t enforce its values in a world where many will always disagree. That takeover won’t exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.
Why train a helpful-only model?
If one of our key defenses against misuse of AI is good ol’ value alignment—building AIs that have some notion of what a “good purpose for them” is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
Yep, I think this is a plausible suggestion. Labs can plausibly train models that are v internally useful without being helpful only, and could fine-tune models for evals on a case-by-case basis (and delete the weights after the evals).
I expected this comment, value alignment or CEV indeed doesn’t have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you’re locked in and there’s no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can’t enforce its values in a world where many will always disagree. That takeover won’t exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.