johnswentworth comments on A challenge for AGI organizations, and a challenge for readers

johnswentworth 1 Dec 2022 23:50 UTC
LW: 41 AF: 17
4
AF
My own responses to OpenAI’s plan:
- Worlds Where Iterative Design Fails, for the RLHF part and also a lot of the general mindset
- Rant on Problem Factorization, for the debate etc part
- Godzilla Strategies, for the “use AI to aid AI alignment” part
These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI’s plan, but I think they cover the most egregious issues.
What links here?
- Cyborgism by NicholasKees (10 Feb 2023 14:47 UTC; 334 points)
- Jozdien 2 Dec 2022 16:16 UTC
  LW: 3 AF: 3
  0
  AF Parent
  I think OpenAI’s approach to “use AI to aid AI alignment” is pretty bad, but not for the broader reason you give here.
  I think of most of the value from that strategy as downweighting probability for some bad properties—in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there’s less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model’s ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
  I don’t think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI’s approach because it views these as solutions in themselves, instead of as simplified problems.