I think OpenAI’s approach to “use AI to aid AI alignment” is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties—in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there’s less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model’s ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don’t think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI’s approach because it views these as solutions in themselves, instead of as simplified problems.
My own responses to OpenAI’s plan:
Worlds Where Iterative Design Fails, for the RLHF part and also a lot of the general mindset
Rant on Problem Factorization, for the debate etc part
Godzilla Strategies, for the “use AI to aid AI alignment” part
These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI’s plan, but I think they cover the most egregious issues.
I think OpenAI’s approach to “use AI to aid AI alignment” is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties—in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there’s less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model’s ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don’t think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI’s approach because it views these as solutions in themselves, instead of as simplified problems.