Thanks for the post (and for linking the research agenda, which I haven’t yet read through)! I’m glad that, even if you use the framing of debate (which I don’t expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.
(If this post is “what would help make debate work for AI alignment,” you can also imagine framings “what would help make updating on human feedback work” [common ARC framing] and “what would help make model-based RL work” [common Charlie framing])
I’d put these subproblems into two buckets:
Developing theorems about specific debate ideas.
These are the most debate-specific.
Formalizing fuzzy notions.
By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.
I think there’s maybe a missing bucket, which is:
Bootstrapping or indirectly generating fuzzy notions.
If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is ‘systematic human error’ - if you make your model detailed enough, what error isn’t systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.
These buckets seem reasonable, and +1 to it being important that some of the resulting ideas are independent of debate. In particular on the inner alignment this exercise (1) made me think exploration hacking might be a larger fraction of the problem than I had thought before, which is encouraging as it might be tractable, but (2) there may be an opening for learning theory that tries to say something about residual error along the lines of https://x.com/geoffreyirving/status/1920554467558105454.
On the systematic human error front, we’ll put out a short post on that soon (next week or so), but broadly the framing is to start with a computation which consults humans, and instead of assuming the humans are have unbiased error instead assume that the humans are wrong on some unknown ε-fraction of queries w.r.t. to some distribution. You can then try to change the debate protocol so that it detects if you can choose the ε-fraction to flip the answer, and report uncertainty in this case. This still requires you to make some big assumption about humans, but is a weaker assumption, and leads to specific ideas for protocol changes.
Thanks for the post (and for linking the research agenda, which I haven’t yet read through)! I’m glad that, even if you use the framing of debate (which I don’t expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.
(If this post is “what would help make debate work for AI alignment,” you can also imagine framings “what would help make updating on human feedback work” [common ARC framing] and “what would help make model-based RL work” [common Charlie framing])
I’d put these subproblems into two buckets:
Developing theorems about specific debate ideas.
These are the most debate-specific.
Formalizing fuzzy notions.
By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.
I think there’s maybe a missing bucket, which is:
Bootstrapping or indirectly generating fuzzy notions.
If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is ‘systematic human error’ - if you make your model detailed enough, what error isn’t systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.
These buckets seem reasonable, and +1 to it being important that some of the resulting ideas are independent of debate. In particular on the inner alignment this exercise (1) made me think exploration hacking might be a larger fraction of the problem than I had thought before, which is encouraging as it might be tractable, but (2) there may be an opening for learning theory that tries to say something about residual error along the lines of https://x.com/geoffreyirving/status/1920554467558105454.
On the systematic human error front, we’ll put out a short post on that soon (next week or so), but broadly the framing is to start with a computation which consults humans, and instead of assuming the humans are have unbiased error instead assume that the humans are wrong on some unknown ε-fraction of queries w.r.t. to some distribution. You can then try to change the debate protocol so that it detects if you can choose the ε-fraction to flip the answer, and report uncertainty in this case. This still requires you to make some big assumption about humans, but is a weaker assumption, and leads to specific ideas for protocol changes.