your current model is aligned in which case you already know how to align AI so what do you even need it for?
I don’t think this is an objection anyone makes. I think it’s widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
If you think this is not a useful strategy, then why not?
Problem factorisation is difficult for non-trivial problems.
That weakness of LLMs is a skill issue, and should (probably, mostly) go away at >human-researcher capability.
My central objections are different, mostly this and this.
Reasonable objections![1] The proposal is only useful as a tactic for preventing sabotage from scheming models (which is definitely a concern I’ve heard). Looking back, I think I may have given the impression that I saw this as a top-level scalable alignment strategy; I only intend it as one useful tool for the toolbox.
Problem factorisation is difficult for non-trivial problems.
Agreed. The proposal is not useful for those parts of alignment which are (as @speck1447 puts it below) irreducible and characteristic of alignment work.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related[2], you can have it do those parts without worrying about sabotage[3]. This has the additional advantage that capabilities improvements work to your benefit.
That weakness of LLMs is a skill issue
When you say ‘that weakness’, you mean the inability to identify a subtask as alignment-related?
I take the objections from those posts as primarily 1) verification can be very hard, and 2) even if models are not scheming, their alignment proposals are likely to fail because they’re sloptimized to look good to researchers rather than to work, and you may not be able to test them before your one critical try.
I don’t think this is an objection anyone makes. I think it’s widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
Problem factorisation is difficult for non-trivial problems.
That weakness of LLMs is a skill issue, and should (probably, mostly) go away at >human-researcher capability.
Thanks!
Reasonable objections![1] The proposal is only useful as a tactic for preventing sabotage from scheming models (which is definitely a concern I’ve heard). Looking back, I think I may have given the impression that I saw this as a top-level scalable alignment strategy; I only intend it as one useful tool for the toolbox.
Agreed. The proposal is not useful for those parts of alignment which are (as @speck1447 puts it below) irreducible and characteristic of alignment work.
A characteristic sort of case where I imagine this helping is: as part of an overall alignment strategy, you need to write a codebase that checks AI output for being manipulative-to-humans before any humans see it. If you ask a potentially misaligned model to write the codebase, you have to worry about sabotage. But if you can successfully separate out parts of the codebase that the model can’t identify as alignment-related[2], you can have it do those parts without worrying about sabotage[3]. This has the additional advantage that capabilities improvements work to your benefit.
When you say ‘that weakness’, you mean the inability to identify a subtask as alignment-related?
I take the objections from those posts as primarily 1) verification can be very hard, and 2) even if models are not scheming, their alignment proposals are likely to fail because they’re sloptimized to look good to researchers rather than to work, and you may not be able to test them before your one critical try.
Not trivial but probably not impossible.
Except insofar as they’re indiscriminately sabotaging code, which is a trait we’re likely to successfully identify in early transformative AI.
Mainly “bad actors can split their work...” with current LLMs, but yeah also identifying/guessing the overall intentions of humans giving subtasks.