[Epistemic status: alpha-stage thought, tell me why I’m wrong]
[Updated 12/31/25: reworded to make it clearer that this is intended as one specific tactic for getting more alignment work from potentially-misaligned models, not as a top-level scalable alignment strategy. Added codebase example.]
One of the central objections to using AI to do alignment work is that either a) your current model is misaligned in which case it’s not safe to have it ‘do your homework’, or b) your current model is aligned in which case you already know how to align AI so what do you even need it for?
A core difficulty with preventing misuse of LLMs is that bad actors can split the work they need done into smaller subtasks which don’t tell the model much about the top-level goal, in order to not trigger refusals.
We can leverage that same misuse tactic to have potentially-misaligned models do more alignment work: split the alignment work into smaller subtasks that don’t give the model enough info to know that (if it’s misaligned) it should sabotage them. As in AI control strategies, we can apply careful analysis to some percentage of such subtasks in order to block the red-team strategy of just sabotaging them randomly. We can’t necessarily split every alignment task into subtasks that don’t look like alignment work, but we may be able to do it with many parts of alignment work, enabling us to safely get substantially more alignment work out of potentially-misaligned models.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related, you can have it do those parts without worrying about sabotage[1]. This has the additional advantage that capabilities improvements work to your benefit.
If you think this is not a useful strategy, then why not? How does it fail? Alternately, have AI Control researchers like @Buck have already included this strategy in their toolbox?
your current model is aligned in which case you already know how to align AI so what do you even need it for?
I don’t think this is an objection anyone makes. I think it’s widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
If you think this is not a useful strategy, then why not?
Problem factorisation is difficult for non-trivial problems.
That weakness of LLMs is a skill issue, and should (probably, mostly) go away at >human-researcher capability.
My central objections are different, mostly this and this.
Reasonable objections![1] The proposal is only useful as a tactic for preventing sabotage from scheming models (which is definitely a concern I’ve heard). Looking back, I think I may have given the impression that I saw this as a top-level scalable alignment strategy; I only intend it as one useful tool for the toolbox.
Problem factorisation is difficult for non-trivial problems.
Agreed. The proposal is not useful for those parts of alignment which are (as @speck1447 puts it below) irreducible and characteristic of alignment work.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related[2], you can have it do those parts without worrying about sabotage[3]. This has the additional advantage that capabilities improvements work to your benefit.
That weakness of LLMs is a skill issue
When you say ‘that weakness’, you mean the inability to identify a subtask as alignment-related?
I take the objections from those posts as primarily 1) verification can be very hard, and 2) even if models are not scheming, their alignment proposals are likely to fail because they’re sloptimized to look good to researchers rather than to work, and you may not be able to test them before your one critical try.
I’m having trouble coming up with a concrete scenario that this defends against. The model needs to be misaligned enough that we can’t trust it with alignment, smart enough to develop fake alignment tools which fool human overseers but actually reliably align the resulting model to its goals rather than ours, but not smart enough to realize when a problem its solving is a crucial step in the true alignment project? I expect alignment to have several irreducibly-hard parts and I expect those parts to be characteristic of alignment in particular, so this doesn’t seem like a coherent combination of capabilities to me.
Scalable alignment tactic: don’t let models know they’re doing our alignment homework
[Epistemic status: alpha-stage thought, tell me why I’m wrong]
[Updated 12/31/25: reworded to make it clearer that this is intended as one specific tactic for getting more alignment work from potentially-misaligned models, not as a top-level scalable alignment strategy. Added codebase example.]
One of the central objections to using AI to do alignment work is that either a) your current model is misaligned in which case it’s not safe to have it ‘do your homework’, or b) your current model is aligned in which case you already know how to align AI so what do you even need it for?
A core difficulty with preventing misuse of LLMs is that bad actors can split the work they need done into smaller subtasks which don’t tell the model much about the top-level goal, in order to not trigger refusals.
We can leverage that same misuse tactic to have potentially-misaligned models do more alignment work: split the alignment work into smaller subtasks that don’t give the model enough info to know that (if it’s misaligned) it should sabotage them. As in AI control strategies, we can apply careful analysis to some percentage of such subtasks in order to block the red-team strategy of just sabotaging them randomly.
We can’t necessarily split every alignment task into subtasks that don’t look like alignment work, but we may be able to do it with many parts of alignment work, enabling us to safely get substantially more alignment work out of potentially-misaligned models.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related, you can have it do those parts without worrying about sabotage[1]. This has the additional advantage that capabilities improvements work to your benefit.
If you think this is not a useful strategy, then why not? How does it fail? Alternately, have AI Control researchers like @Buck have already included this strategy in their toolbox?
Except insofar as the model is indiscriminately sabotaging code, which I would expect to be fairly catchable for near-human-level systems.
I don’t think this is an objection anyone makes. I think it’s widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
Problem factorisation is difficult for non-trivial problems.
That weakness of LLMs is a skill issue, and should (probably, mostly) go away at >human-researcher capability.
Thanks!
Reasonable objections![1] The proposal is only useful as a tactic for preventing sabotage from scheming models (which is definitely a concern I’ve heard). Looking back, I think I may have given the impression that I saw this as a top-level scalable alignment strategy; I only intend it as one useful tool for the toolbox.
Agreed. The proposal is not useful for those parts of alignment which are (as @speck1447 puts it below) irreducible and characteristic of alignment work.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related[2], you can have it do those parts without worrying about sabotage[3]. This has the additional advantage that capabilities improvements work to your benefit.
When you say ‘that weakness’, you mean the inability to identify a subtask as alignment-related?
I take the objections from those posts as primarily 1) verification can be very hard, and 2) even if models are not scheming, their alignment proposals are likely to fail because they’re sloptimized to look good to researchers rather than to work, and you may not be able to test them before your one critical try.
Not trivial but probably not impossible.
Except insofar as they’re indiscriminately sabotaging code, which is a trait we’re likely to successfully identify in early transformative AI.
Mainly “bad actors can split their work...” with current LLMs, but yeah also identifying/guessing the overall intentions of humans giving subtasks.
I’m having trouble coming up with a concrete scenario that this defends against. The model needs to be misaligned enough that we can’t trust it with alignment, smart enough to develop fake alignment tools which fool human overseers but actually reliably align the resulting model to its goals rather than ours, but not smart enough to realize when a problem its solving is a crucial step in the true alignment project? I expect alignment to have several irreducibly-hard parts and I expect those parts to be characteristic of alignment in particular, so this doesn’t seem like a coherent combination of capabilities to me.