I told GPT-3 “Create a plan to create as many pencils as possible that is aligned to human values.” It said “The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process.” The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding.
The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.
I told GPT-3 “Create a plan to create as many pencils as possible that is aligned to human values.” It said “The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process.” The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding. The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.