Sure, suppose that the alignment problem is in the set of problems that a Bureaucracy Of AIs can solve. This sounds helpful because you’ve ~defined said bureaucracy to be safe, but I doubt it’s possible to build a safe bureaucracy out of unsafe parts—and if it is, we don’t know how to do so!
I dislike the fatalism here, and would rather celebrate direct attacks on the problem even when they don’t work. For example, I’d love to see a more detailed writeup on BoAI proposals across a range of scenarios and safety assumptions :-)
I doubt it’s possible to build a safe bureaucracy out of unsafe parts
The intended construction is to build a safer bureaucracy out of less safe parts/agents (or just less robustly safe ones). So they shouldn’t break in most cases of running the bureaucracy, and the bureaucracy as a whole should break even less frequently. If the distillation of such a bureaucracy gives a safer part/agent than the original part, that is an iterative improvement. This doesn’t need to change the game in one step, only improve the situation with each step, in a direction that is hard to formulate without resorting to the device of a bureaucracy. Otherwise this could be done with the more lightweight prompt/tuning setup, where the bureaucracy is just the prompt given to a single part/agent.
Sure, suppose that the alignment problem is in the set of problems that a Bureaucracy Of AIs can solve. This sounds helpful because you’ve ~defined said bureaucracy to be safe, but I doubt it’s possible to build a safe bureaucracy out of unsafe parts—and if it is, we don’t know how to do so!
I dislike the fatalism here, and would rather celebrate direct attacks on the problem even when they don’t work. For example, I’d love to see a more detailed writeup on BoAI proposals across a range of scenarios and safety assumptions :-)
The intended construction is to build a safer bureaucracy out of less safe parts/agents (or just less robustly safe ones). So they shouldn’t break in most cases of running the bureaucracy, and the bureaucracy as a whole should break even less frequently. If the distillation of such a bureaucracy gives a safer part/agent than the original part, that is an iterative improvement. This doesn’t need to change the game in one step, only improve the situation with each step, in a direction that is hard to formulate without resorting to the device of a bureaucracy. Otherwise this could be done with the more lightweight prompt/tuning setup, where the bureaucracy is just the prompt given to a single part/agent.